0% found this document useful (0 votes)
23 views

Deep Learning Final Sheet

A single neuron, or perceptron, performs a weighted sum of its input values plus a bias term, and applies an activation function to output a value, representing different functions depending on the choice of activation. It is the basic computational unit of neural networks, with the input features represented by nodes connecting to weights that feed into the output neuron. Adding a constant input node allows the bias term to be absorbed into the weight vector for a graphical representation of the computation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Deep Learning Final Sheet

A single neuron, or perceptron, performs a weighted sum of its input values plus a bias term, and applies an activation function to output a value, representing different functions depending on the choice of activation. It is the basic computational unit of neural networks, with the input features represented by nodes connecting to weights that feed into the output neuron. Adding a constant input node allows the bias term to be absorbed into the weight vector for a graphical representation of the computation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 915

Deep Learning

Introduction
Learning goals
Bird!

No bird!

...
“Oh, it’s a bird!” Relationship of DL and ML

... ...
some magic Concept of representation or
feature learning
... ... Use-cases and data types for DL
methods
WHAT IS DEEP LEARNING

Artificial Machine Deep


Intelligence Learning Learning

Deep learning is a subield of ML based on artificial neural


networks.

Deep Learning – 1 / 12
DEEP LEARNING AND NEURAL NETWORKS

Deep learning itself is not new:


Neural networks have been around since the 70s.
Deep neural networks, i.e., networks with multiple hidden
layers, are not much younger.

Why everybody is talking about deep learning now:


1 Specialized, powerful hardware allows training of huge neural
networks to push the state-of-the-art on difficult problems.
2 Large amount of data is available.
3 Special network architectures for image/text data.
4 Better optimization and regularization strategies.

Deep Learning – 2 / 12
IMAGE CLASSIFICATION WITH NEURAL
NETWORKS
“Machine learning algorithms, inspired by the brain, based on
learning multiple levels of representation/abstraction.”
Y. Bengio

“Oh, it’s a bird!” “Oh, it’s a bird!”


“No bird!!”

some magic

... ...
x1 x2 x3

Deep Learning – 3 / 12
IMAGE CLASSIFICATION WITH NEURAL
NETWORKS
“Machine learning algorithms, inspired by the brain, based on
learning multiple levels of representation/abstraction.”
Y. Bengio

“Oh, it’s a bird!” “Oh, it’s a bird!”


“No bird!!”

some magic

... ...

Deep Learning – 3 / 12
IMAGE CLASSIFICATION WITH NEURAL
NETWORKS
“Machine learning algorithms, inspired by the brain, based on
learning multiple levels of representation/abstraction.”
Y. Bengio

Bird!

No bird!

“Oh, it’s a bird!”


...

some magic
... ...

... ...

Deep Learning – 3 / 12
IMAGE CLASSIFICATION WITH NEURAL
NETWORKS
“Machine learning algorithms, inspired by the brain, based on
learning multiple levels of representation/abstraction.”
Y. Bengio

Bird!

No bird!

“Oh, it’s a bird!”


...

some magic
... ...

... ...

Deep Learning – 3 / 12
POSSIBLE USE-CASES
Deep learning can be extremely valuable if the data has these
properties:

It is high dimensional.
Each single feature itself is not very informative but only a
combination of them might be.
There is a large amount of training data.

This implies that for tabular data, deep learning is rarely the
correct model choice.

Without extensive tuning, models like random forests or gradient


boosting will outperform deep learning most of the time.
One exception is data with categorical features with many levels.

Deep Learning – 4 / 12
POSSIBLE USE-CASE: IMAGES
High Dimensional: A color image with 255 × 255 (3 Colors)
pixels already has 195075 features.
Informative: A single pixel is not meaningful in itself.
Training Data: Depending on applications huge amounts of data
are available.

Architecture: Convolutional Neural Networks (CNN)

Deep Learning – 5 / 12
POSSIBLE USE-CASE: IMAGES

Credit: Alex Krizhevsky (2009)

Image classification tries to predict a single label for each image.


CIFAR-10 is a well-known dataset used for image classification. It consists of 60, 000
32x32 color images containing one of 10 object classes, with 6000 images per class.

Deep Learning – 6 / 12
POSSIBLE USE-CASE: IMAGES

Credit: Kaiming He (2017)

Object Detection Mask R-CNN is a general framework for instance segmentation,


that efficiently detects objects in an image while simultaneously generating a
high-quality segmentation mask for each instance.
Deep Learning – 7 / 12
POSSIBLE USE-CASE: IMAGES

Credit: Hyeonwoo Noh (2015)

Image segmentation partitions the image into (multiple) segments.

Deep Learning – 8 / 12
POSSIBLE USE-CASE: TEXT
High Dimensional: Each word can be a single feature (300000
words in the German language).
Informative: A single word does not provide much context.
Training Data: Huge amounts of text data available.

Architecture: Recurrent Neural Networks (RNN)

Deep Learning – 9 / 12
POSSIBLE USE-CASE: TEXT CLASSIFICATION

Sentiment Analysis is the application of natural language processing


to systematically identify the emotional and subjective information in
texts.

Deep Learning – 10 / 12
POSSIBLE USE-CASE: TEXT

Machine Translation (e.g. google translate) Neural machine translation


exploits neural networks to predict the likelihood of a sequence of
words, typically modeling entire sentences in a single integrated model.

Deep Learning – 11 / 12
APPLICATIONS OF DEEP LEARNING: SPEECH

Speech Recognition and Generation (e.g. google assistant) Neural


network extracts features from audio data for downstream tasks, e.g., to
classify emotions in speech.

Deep Learning – 12 / 12
Deep Learning

Single Neuron / Perceptron

Learning goals
Graphical representation of a
single neuron
Affine transformations and
non-linear activation functions
Hypothesis spaces of a single
neuron
Typical loss functions
A SINGLE NEURON

Perceptron with input features x1 , x2 , ..., xp , weights w1 , w2 , ..., wp , bias term b, and
activation function τ .

The perceptron is a single artificial neuron and the basic


computational unit of neural networks.

It is a weighted sum of input values, transformed by τ :

f (x ) = τ (w1 x1 + ... + wp xp + b) = τ (wT x + b)

Deep Learning – 1 / 11
A SINGLE NEURON
Activation function τ : a single neuron represents different functions
depending on the choice of activation function.

The identity function gives us the simple linear regression:

f (x ) = τ (wT x) = wT x

The logistic function gives us the logistic regression:

1
f (x ) = τ (wT x) =
1 + exp(−wT x)

Deep Learning – 2 / 11
A SINGLE NEURON
We consider a perceptron with 3-dimensional input, i.e.
f (x) = τ (w1 x1 + w2 x2 + w3 x3 + b).
Input features x are represented by nodes in the “input layer”.

In general, a p-dimensional input vector x will be represented by p


nodes in the input layer.

Deep Learning – 3 / 11
A SINGLE NEURON
Weights w are connected to edges from the input layer.

The bias term b is implicit here. It is often not visualized as a


separate node.

Deep Learning – 4 / 11
A SINGLE NEURON
For an explicit graphical representation, we do a simple trick:
Add a constant feature to the inputs x̃ = (1, x1 , ..., xp )T
and absorb the bias into the weight vector w̃ = (b, w1 , ..., wp ).
The graphical representation is then:

Deep Learning – 5 / 11
A SINGLE NEURON
The computation τ (w1 x1 + w2 x2 + w3 x3 + b) is represented by the
neuron in the “output layer”.

Deep Learning – 6 / 11
A SINGLE NEURON
You can picture the input vector being "fed" to neurons on the left
followed by a sequence of computations performed from left to
right. This is called a forward pass.

Deep Learning – 7 / 11
A SINGLE NEURON
A neuron performs a 2-step computation:
1 Affine Transformation: weighted sum of inputs plus bias.

2 Non-linear Activation: a non-linear transformation applied to the


weighted sum.

Deep Learning – 8 / 11
A SINGLE NEURON: HYPOTHESIS SPACE
The hypothesis space that is formed by single neuron is
p
( ! )
X
p p
H = f : R → R f (x) = τ w j xj + b , w ∈ R , b ∈ R .
j =1

If τ is the logistic sigmoid or identity function, H corresponds to


the hpothesis space of logistic or linear regression, respectively.

Figure: Left: A regression line learned by a single neuron. Right: A


decision-boundary learned by a single neuron in a binary classification task.

Deep Learning – 9 / 11
A SINGLE NEURON: OPTIMIZATION

To optimize this model, we minimize the empirical risk


n
1 X  (i )  (i ) 
Remp = L y ,f x ,
n
i =1

where L (y , f (x)) is a loss function. It compares the network’s


predictions f (x) to the ground truth y .
For regression, we typically use the L2 loss (rarely L1):

1
L (y , f (x)) = (y − f (x))2
2
For binary classification, we typically apply the cross entropy loss
(also known as Bernoulli loss):

L (y , f (x)) = −(y log f (x) + (1 − y ) log(1 − f (x)))

Deep Learning – 10 / 11
A SINGLE NEURON: OPTIMIZATION
For a single neuron and both choices of τ the loss function is
convex.
The global optimum can be found with an iterative algorithm like
gradient descent.
A single neuron with logistic sigmoid function trained with the
Bernoulli loss yields the same result as logistic regression when
trained until convergence.
Note: In the case of regression and the L2-loss, the solution can
also be found analytically using the “normal equations”. However,
in other cases a closed-form solution is usually not available.

Deep Learning – 11 / 11
Deep Learning

XOR-Problem

Learning goals
Example problem a single
neuron can not solve but a single
hidden layer net can
EXAMPLE: XOR PROBLEM

Suppose we have four data points

X = {(0, 0)> , (0, 1)> , (1, 0)> , (1, 1)> }

The XOR gate (exclusive or) returns true, when an odd number of
inputs are true:

x1 x2 XOR = y
0 0 0
0 1 1
1 0 1
1 1 0

Can you learn the target function with a logistic regression model?

Deep Learning – 1 / 10
EXAMPLE: XOR PROBLEM
Logistic regression can not
solve this problem. In fact,
any model using simple
hyperplanes for separation
can not (including a single
neuron).

A small neural net can


easily solve the problem by
transforming the space!

Deep Learning – 2 / 10
EXAMPLE: XOR PROBLEM
Consider the following model:

Figure: A neural network with two neurons in the hidden layer. The matrix W
describes the mapping from x to z. The vector u from z to y .

Deep Learning – 3 / 10
EXAMPLE: XOR PROBLEM
Let use ReLU σ(z ) = max {0, z } as activation function
( and a
1 if z > 0
simple thresholding function τ (z ) = [z > 0] =
0 otherwise
as output transformation function. We can represent the
architecture of the model by the following equation:
 
f (x | θ) = f (x | W, b, u, c ) = τ u> σ(W> x + b) + c
 
= τ u> max{0, W> x + b} + c

So how many parameters does our model have?


In a fully connected neural net, the number of connections
between the nodes equals our parameters:

(2 × 2) + (2 × 1) + (2 × 1) + (1) = 9
| {z } | {z } | {z } |{z}
W b u c

Deep Learning – 4 / 10
EXAMPLE: XOR PROBLEM
     
1 1 0 1
Let W = , b= , u= , c = −0.5
1 1 −1 −2

     
0 0 0 0 0 −1
0 1 1 1 1 0 
X=
1
 , XW = 
1 1 , XW + B = 1 0 
  
0
1 1 2 2 2 1

Note: X is a (n × p) design matrix in which the rows correspond to the data points. W,
as usual, is a (p × m) matrix where each column corresponds to a single (hidden)
neuron. B is a (n × m) matrix with b duplicated along the rows.

Deep Learning – 5 / 10
EXAMPLE: XOR PROBLEM
     
1 1 0 1
Let W = , b= , u= , c = −0.5
1 1 −1 −2

     
0 0 0 0 0 −1
0 1 1 1 1 0 
X=
1
 , XW = 
1 1 , XW + B = 1 0 
  
0
1 1 2 2 2 1

 
0 0
1 0
Z = max{0, XW + B } = 
1

0
2 1

Note that we computed all examples at once.

Deep Learning – 6 / 10
EXAMPLE: XOR PROBLEM

The input points are


mapped into transformed
space to
 
0 0
1 0
Z =
1

0
2 1

Deep Learning – 7 / 10
EXAMPLE: XOR PROBLEM

The input points are


mapped into transformed
space to
 
0 0
1 0
Z =
1

0
2 1

which is easily separable.

Deep Learning – 8 / 10
EXAMPLE: XOR PROBLEM
In a final step we have to multiply the activated values of matrix Z
with the vector u and add the bias term c:
     
0 0   −0.5 −0.5
1 0 1
  −0.5  0.5 
 
f (x | W, b, u, c ) =  + = 
1 0 −2  −0.5  0.5 
2 1 −0.5 −0.5

And then apply the step function τ (z ) = [z > 0]. This solves the
XOR problem perfectly!

x1 x2 XOR = y
0 0 0
0 1 1
1 0 1
1 1 0

Deep Learning – 9 / 10
NEURAL NETWORKS : OPTIMIZATION

In this simple example we actually “guessed” the values of the


parameters for W , b, u and c.

That won’t work for more sophisticated problems!

We will learn later about iterative optimization algorithms for


automatically adapting weights and biases.

An added complication is that the loss function is no longer


convex. Therefore, there might not exist a single minimum.

Deep Learning – 10 / 10
Deep Learning

Single hidden layer neural networks

Learning goals
Architecture of single hidden
layer neural networks
Representation learning/
understanding the advantage of
hidden layers
Typical (non-linear) activation
functions
MOTIVATION

The graphical way of representing simple functions/models, like


logistic regression. Why is that useful?

Because individual neurons can be used as building blocks of


more complicated functions.

Networks of neurons can represent extremely complex hypothesis


spaces.

Most importantly, it allows us to define the “right” kinds of


hypothesis spaces to learn functions that are common in our
universe in a data-efficient way (see Lin, Tegmark et al. 2016).

Bernd Bischl Deep Learning – 1 / 14


MOTIVATION
Can a single neuron perform binary classification of these points?

Bernd Bischl Deep Learning – 2 / 14


MOTIVATION
As a single neuron is restricted to learning only linear decision
boundaries, its performance on the following task is quite poor:

However, the neuron can easily separate the classes if the original
features are transformed (e.g., from Cartesian to polar
coordinates):

Bernd Bischl Deep Learning – 3 / 14


MOTIVATION
Instead of classifying the data in the original representation,

we classify it in a new feature space.

Bernd Bischl Deep Learning – 4 / 14


MOTIVATION
Analogously, instead of a single neuron,

we use more complex networks.

Bernd Bischl Deep Learning – 5 / 14


REPRESENTATION LEARNING

It is very critical to feed a classifier the “right” features in order for it


to perform well.

Before deep learning took off, features for tasks like machine
vision and speech recognition were “hand-designed” by domain
experts. This step of the machine learning pipeline is called
feature engineering.

DL automates feature engineering. This is called representation


learning.

Bernd Bischl Deep Learning – 6 / 14


SINGLE HIDDEN LAYER NETWORKS
Single neurons perform a 2-step computation:
1 Affine Transformation: a weighted sum of inputs plus bias.
2 Activation: a non-linear transformation on the weighted sum.

Single hidden layer networks consist of two layers (without input


layer):
1 Hidden Layer: having a set of neurons.
2 Output Layer: having one or more output neurons.

Multiple inputs are simultaneously fed to the network.

Each neuron in the hidden layer performs a 2-step computation.

The final output of the network is then calculated by another 2-step


computation performed by the neuron in the output layer.

Bernd Bischl Deep Learning – 7 / 14


SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
Each neuron in the hidden layer performs an affine transformation on
the inputs:

(1)
zin = w11 x (1) + w21 x (2) + w31 x (3) + b1
(1)
zin = 3 ∗ (−3) + (−9) ∗ 1 + 2 ∗ 5 + 5 = −3

Bernd Bischl Deep Learning – 8 / 14


SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
Each neuron in the hidden layer performs an affine transformation on
the inputs:

(2)
zin = w12 x (1) + w22 x (2) + w32 x (3) + b2
(2)
zin = 11 ∗ (−3) + (−2) ∗ 1 + 7 ∗ 5 + 2 = 2

Bernd Bischl Deep Learning – 8 / 14


SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
Each neuron in the hidden layer performs an affine transformation on
the inputs:

(3)
zin = w13 x (1) + w23 x (2) + w33 x (3) + b3
(3)
zin = (−6) ∗ (−3) + 3 ∗ 1 + (−4) ∗ 5 − 1 = 0

Bernd Bischl Deep Learning – 8 / 14


SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
Each neuron in the hidden layer performs an affine transformation on
the inputs:

(4)
zin = w14 x (1) + w24 x (2) + w34 x (3) + b4
(4)
zin = 6 ∗ (−3) + (−1) ∗ 1 + 5 ∗ 5 + 1 = 7

Bernd Bischl Deep Learning – 8 / 14


SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
Each neuron in the hidden layer performs an affine transformation on
the inputs:

Bernd Bischl Deep Learning – 8 / 14


SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
Each hidden neuron performs a non-linear activation transformation on
the weight sum:

 
(i ) (i ) 1
zout = σ zin = (i )
−z
1+ e in

Bernd Bischl Deep Learning – 9 / 14


SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
The output neuron performs an affine transformation on its inputs:

(1) (2) ( 3) (4)


fin = u1 zout + u2 zout + u3 zout + u4 zout + c

Bernd Bischl Deep Learning – 10 / 14


SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
The output neuron performs an affine transformation on its inputs:

(1) (2) (3) (4)


fin = u1 zout + u2 zout + u3 zout + u4 zout + c
fin = 3 ∗ 0.05 + (−12) ∗ 0.88 + 8 ∗ 0.50 + 1 ∗ 0.99 + 6 = 0.57

Bernd Bischl Deep Learning – 10 / 14


SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
The output neuron performs a non-linear activation transformation on
the weight sum:

fout = σ(fin ) = 1+e1−fin


fout = 1+e1−0.57 = 0.64

Bernd Bischl Deep Learning – 11 / 14


HIDDEN LAYER: ACTIVATION FUNCTION
If the hidden layer does not have a non-linear activation, the
network can only learn linear decision boundaries.
A lot of different activation functions exist.

Bernd Bischl Deep Learning – 12 / 14


HIDDEN LAYER: ACTIVATION FUNCTION

ReLU Activation:

Currently the most popular choice is the ReLU (rectified linear


unit):
σ(v ) = max(0, v )

Bernd Bischl Deep Learning – 13 / 14


HIDDEN LAYER: ACTIVATION FUNCTION
Sigmoid Activation Function:

The sigmoid function can be used even in the hidden layer:

1
σ(v ) =
1 + exp(−v )

Bernd Bischl Deep Learning – 14 / 14


Deep Learning

Single Hidden Layer Networks for


Multi-Class Classification

Learning goals
Neural network architectures for
multi-class classification
Softmax activation function
Softmax loss
MULTI-CLASS CLASSIFICATION

We have only considered regression and binary classification


problems so far.

How can we get a neural network to perform multiclass


classification?

Deep Learning – 1 / 6
MULTI-CLASS CLASSIFICATION
The first step is to add additional neurons to the output layer.
Each neuron in the layer will represent a specific class (number of
neurons in the output layer = number of classes).

Figure: Structure of a single hidden layer, feed-forward neural network for


g-class classification problems (bias term omitted).

Deep Learning – 2 / 6
MULTI-CLASS CLASSIFICATION

Notation:

For g-class classification, g output units:

f = (f1 , . . . , fg )

m hidden neurons z1 , . . . , zm , with

zj = σ(WTj x), j = 1, . . . , m.

Compute linear combinations of derived features z:

fin,k = UkT z, z = (z1 , . . . , zm )T , k = 1, . . . , g

Deep Learning – 3 / 6
MULTI-CLASS CLASSIFICATION

The second step is to apply a softmax activation function to the


output layer.

This gives us a probability distribution over g different possible


classes:
exp(fin,k )
fout ,k = τk (fin,k ) = Pg
k 0 =1 exp(fin,k )
0

This is the same transformation used in softmax regression!

∂τ (fin )
Derivative ∂ fin = diag(τ (fin )) − τ (fin )τ (fin )T

It is a “smooth” approximation of the argmax operation, so


τ ((1, 1000, 2)T ) ≈ (0, 1, 0)T (picks out 2nd element!).

Deep Learning – 4 / 6
MULTI-CLASS CLASSIFICATION: EXAMPLE
Forward pass (Hidden: Sigmoid, Output: Softmax).

Deep Learning – 5 / 6
MULTI-CLASS CLASSIFICATION: EXAMPLE
Forward pass (Hidden: Sigmoid, Output: Softmax).

Deep Learning – 5 / 6
MULTI-CLASS CLASSIFICATION: EXAMPLE
Forward pass (Hidden: Sigmoid, Output: Softmax).

Deep Learning – 5 / 6
MULTI-CLASS CLASSIFICATION: EXAMPLE
Forward pass (Hidden: Sigmoid, Output: Softmax).

Deep Learning – 5 / 6
MULTI-CLASS CLASSIFICATION: EXAMPLE
Forward pass (Hidden: Sigmoid, Output: Softmax).

Deep Learning – 5 / 6
MULTI-CLASS CLASSIFICATION: EXAMPLE
Forward pass (Hidden: Sigmoid, Output: Softmax).

Deep Learning – 5 / 6
MULTI-CLASS CLASSIFICATION: EXAMPLE
Forward pass (Hidden: Sigmoid, Output: Softmax).

Deep Learning – 5 / 6
MULTI-CLASS CLASSIFICATION: EXAMPLE
Forward pass (Hidden: Sigmoid, Output: Softmax).

Deep Learning – 5 / 6
OPTIMIZATION: SOFTMAX LOSS

The loss function for a softmax classifier is


g  
X exp(fin,k )
L(y , f (x)) = − [y = k ] log Pg
k 0 =1 exp(fin,k )
0
k =1
(
1 if y = k
where [y = k ] = .
0 otherwise

This is equivalent to the cross-entropy loss when the label vector y


is one-hot coded (e.g. y = (0, 0, 1, 0)T ).
Optimization: Again, there is no analytic solution.

Deep Learning – 6 / 6
Deep Learning

MLP – Multi-Layer Feedforward Neural


Networks

Learning goals
Architectures of deep neural
networks
Deep neural networks as
chained functions
FEEDFORWARD NEURAL NETWORKS

We will now extend the model class once again, such that we allow
an arbitrary amount l of hidden layers.

The general term for this model class is (multi-layer) feedforward


networks (inputs are passed through the network from left to right,
no feedback-loops are allowed)

Deep Learning – 1 / 7
FEEDFORWARD NEURAL NETWORKS
We can characterize those models by the following chain structure:

f (x) = τ ◦ φ ◦ σ (l ) ◦ φ(l ) ◦ σ (l −1) ◦ φ(l −1) ◦ . . . ◦ σ (1) ◦ φ(1)

where σ (i ) and φ(i ) are the activation function and the weighted
sum of hidden layer i, respectively. τ and φ are the corresponding
components of the output layer.

Each hidden layer has:

an associated weight matrix W(i ) , bias b(i ) , and activations


z(i ) for i ∈ {1 . . . l }.
z(i ) = σ (i ) (φ(i ) ) = σ (i ) (W(i )T z(i −1) + b(i ) ) , where z(0) = x.

Again, without non-linear activations in the hidden layers, the


network can only learn linear decision boundaries.

Deep Learning – 2 / 7
FEEDFORWARD NEURAL NETWORKS

Figure: Structure of a deep neural network with l hidden layers (bias terms
omitted).

Deep Learning – 3 / 7
FEEDFORWARD NEURAL NETWORKS: EXAMPLE

Deep Learning – 4 / 7
FEEDFORWARD NEURAL NETWORKS: EXAMPLE

Deep Learning – 4 / 7
FEEDFORWARD NEURAL NETWORKS: EXAMPLE

Deep Learning – 4 / 7
FEEDFORWARD NEURAL NETWORKS: EXAMPLE

Deep Learning – 4 / 7
FEEDFORWARD NEURAL NETWORKS: EXAMPLE

Deep Learning – 4 / 7
FEEDFORWARD NEURAL NETWORKS: EXAMPLE

Deep Learning – 4 / 7
FEEDFORWARD NEURAL NETWORKS: EXAMPLE

Deep Learning – 4 / 7
FEEDFORWARD NEURAL NETWORKS: EXAMPLE

Deep Learning – 4 / 7
FEEDFORWARD NEURAL NETWORKS: EXAMPLE

Deep Learning – 4 / 7
FEEDFORWARD NEURAL NETWORKS: EXAMPLE

Deep Learning – 4 / 7
FEEDFORWARD NEURAL NETWORKS: EXAMPLE

Deep Learning – 4 / 7
FEEDFORWARD NEURAL NETWORKS: EXAMPLE

Deep Learning – 4 / 7
WHY ADD MORE LAYERS?

Multiple layers allow for the extraction of more and more abstract
representations.
Each layer in a feed-forward neural network adds its own degree of
non-lnearity to the model.

Figure: An intuitive, geometric explanation of the exponential advantage of


deeper networks formally (Montúfar et al. (2014)).

Deep Learning – 5 / 7
DEEP NEURAL NETWORKS
Neural networks today can have hundreds of hidden layers. The greater
the number of layers, the "deeper" the network. Historically DNNs were
very challenging to train and not popular until the late ’00s for several
reasons:

The use of sigmoid activations (e.g., logistic sigmoid and tanh)


significantly slowed down training due to a phenomenon known as
“vanishing gradients”. The introduction of the ReLU activation
largely solved this problem.
Training DNNs on CPUs was too slow to be practical. Switching
over to GPUs cut down training time by more than an order of
magnitude.
When dataset sizes are small, other models (such as SVMs) and
techniques (such as feature engineering) often outperform them.

Deep Learning – 6 / 7
DEEP NEURAL NETWORKS
The availability of large datasets and novel architectures that are
capable of handling even complex tensor-shaped data (e.g. CNNs
for image data), faster hardware, and better optimization and
regularization methods made it feasible to successfully implement
deep neural networks.

An increase in depth often translates to an increase in


performance on a given task. State-of-the-art neural networks,
however, are much more sophisticated than the simple
architectures we have encountered so far.

The term "deep learning" encompasses all of these developments and


refers to the field as a whole.

Deep Learning – 7 / 7
Deep Learning

MLP – Matrix Notation

Learning goals
Compact representation of
neural network equations
Vector notation for neuron layers
Vector and matrix notation of
bias and weight parameters
SINGLE HIDDEN LAYER NETWORKS: NOTATIONS

The input x is a column vector with dimensions p × 1.


W is a weight matrix with dimensions p × m, where m is the
amount of hidden neurons:
 
w1,1 w1,2 · · · w1,m
w2,1 w2,2 · · · w2,m 
W= .
 
.. .. .. 
 .. . . . 
wp,1 w p ,2 · · · w p ,m

Deep Learning – 1 / 9
SINGLE HIDDEN LAYER NETWORKS: NOTATIONS
Hidden layer:
For example, to obtain z1 , we pick the first column of W :
 
w1,1
w2,1 
W1 =  . 
 
 .. 
wp,1

and compute
z1 = σ(WT1 x + b1 ) ,
where b1 is the bias of the first hidden neuron and σ : R → R is
an activation function.

Deep Learning – 2 / 9
SINGLE HIDDEN LAYER NETWORKS: NOTATION

The network has m hidden neurons z1 , . . . , zm with

zj = σ(WTj x + bj )

zin,j = WTj x + bj

zout ,j = σ(zin,j ) = σ(WTj x + bj )

for j ∈ {1, . . . , m}.

Vectorized notation:
zin = (zin,1 , . . . , zin,m )T = WT x + b
(Note: WT x = (xT W)T )
z = zout = σ(zin ) = σ(WT x + b), where the (hidden layer)
activation function σ is applied element-wise to zin .

Deep Learning – 3 / 9
SINGLE HIDDEN LAYER NETWORKS: NOTATION
Bias term:
We sometimes omit the bias term by adding a constant
feature to the input x̃ = (1, x1 , ..., xp ) and by adding the bias
term to the weight matrix

W̃ = (b, W1 , ..., Wp ).

Note: For simplification purposes, we will not explicitly


represent the bias term graphically in the following. However,
the above “trick” makes it straightforward to represent it
graphically.

Deep Learning – 4 / 9
SINGLE HIDDEN LAYER NETWORKS: NOTATION
Output layer:

For regression or binary classification: one output unit f where


fin = uT z + c , i.e. a linear combination of derived features
plus the bias term c of the output neuron, and
f (x) = fout = τ (fin ) = τ (uT z + c ) , where τ is the output
activation function.
For regression τ is the identity function.
For binary classification, τ is a sigmoid function.
Note: The purpose of the hidden-layer activation function σ is to
introduce non-linearities so that the network is able to learn
complex functions whereas the purpose of τ is merely to get the
final score to the same range as the target.

Deep Learning – 5 / 9
SINGLE HIDDEN LAYER NETWORKS: NOTATION
Multiple inputs:
It is possible to feed multiple inputs to a neural network
simultaneously.

The inputs x(i ) , for i ∈ {1, . . . , n}, are arranged as rows in the
design matrix X.
X is a (n × p)-matrix.

The weighted sum in the hidden layer is now computed as


XW + B, where,
W, as usual, is a (p × m) matrix, and,
B is a (n × m) matrix containing the bias vector b (duplicated)
as the rows of the matrix.

The matrix of hidden activations Z = σ(XW + B )


Z is a (n × m) matrix.

Deep Learning – 6 / 9
SINGLE HIDDEN LAYER NETWORKS: NOTATION
The final output of the network, which contains a prediction for
each input, is τ (Z u + C ), where

u is the vector of weights of the output neuron, and,


C is a (n × 1) matrix whose elements are the (scalar) bias c
of the output neuron.

Deep Learning – 7 / 9
SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
Weights (and biases) of the network.

Deep Learning – 8 / 9
SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
Weights (and biases) of the network.

Deep Learning – 8 / 9
SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
Weights (and biases) of the network.

Deep Learning – 8 / 9
SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
Weights (and biases) of the network.

Deep Learning – 8 / 9
SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
Weights (and biases) of the network.

Deep Learning – 8 / 9
SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
Weights (and biases) of the network.

Deep Learning – 8 / 9
SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
Forward pass through the shallow neural network.

Deep Learning – 9 / 9
SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
Forward pass through the shallow neural network.

Deep Learning – 9 / 9
SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
Forward pass through the shallow neural network.

Deep Learning – 9 / 9
SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
Forward pass through the shallow neural network.

Deep Learning – 9 / 9
SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
Forward pass through the shallow neural network.

Deep Learning – 9 / 9
SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
Forward pass through the shallow neural network.

Deep Learning – 9 / 9
Deep Learning

Universal Approximation
nnet: size=4; maxit=1e+03
Train: mse=0.029; CV: mse.test.mean=0.055

Learning goals



1.0 ●

● ●

● ●

Universal approximation theorem


● ●

● ●
0.5 ● ●
● ●
● ●●

● ●

●●●

● ●

for one-hidden-layer neural
y

0.0

networks

● ●
● ●


−0.5


The pros and cons of a low
−1.0
● ●
● ●



approximation error
0.0 2.5 5.0 7.5 10.0
x
UNIVERSAL APPROXIMATION PROPERTY
Theorem. Let σ : R → R be a continuous, non-constant, bounded,
and monotonically increasing function. Let C ⊂ Rp be compact, and let
C(C ) denote the space of continuous functions C → R. Then, given a
function g ∈ C(C ) and an accuracy ε > 0, there exists a hidden layer
size m ∈ N and a set of coefficients Wj ∈ Rp , uj , bj ∈ R (for
j ∈ {1, . . . , m}), such that
m
X  
f : C → R; f (x) = uj · σ WjT x + bj
j =1

is an ε-approximation of g, that is,

∥f − g ∥∞ := max |f (x) − g (x)| < ε .


x ∈C

The theorem extends trivially to multiple outputs.

Deep Learning – 1 / 14
UNIVERSAL APPROXIMATION PROPERTY
Corollary. Neural networks with a single sigmoidal hidden layer and
linear output layer are universal approximators.
This means that for a given
 target function g there exists a
sequence of networks fk k ∈N that converges (pointwise) to g.

Usually, as the networks come closer and closer to g, they will


need more and more hidden neurons.

A network with fixed layer sizes can only model a subspace of all
continuous functions.

The continuous functions form an infinite dimensional vector


space. Therefore arbitrarily large hidden layer sizes are needed.

Deep Learning – 2 / 14
UNIVERSAL APPROXIMATION PROPERTY
Why is universal approximation a desirable property?

Recall the definition of a Bayes optimal hypothesis f ∗ : X → Y . It


is the best possible hypothesis (model) for the given problem: it
has minimal loss averaged over the data generating distribution.

So ideally we would like the neural network (or any other learner)
to approximate the Bayes optimal hypothesis.

Usually we do not manage to learn f ∗ .

This is because we do not have enough (infinite) data. We have no


control over this, so we have to live with this limitation.

But we do have control over which model class we use.

Deep Learning – 3 / 14
UNIVERSAL APPROXIMATION PROPERTY
Universal approximation ⇒ approximation error tends to zero as
hidden layer size tends to infinity.

Positive approximation error implies that no matter how big the


data set, we cannot find the optimal model.

This bears the risk of systematic under-fitting, which can be


avoided with a universal model class.

Deep Learning – 4 / 14
UNIVERSAL APPROXIMATION PROPERTY
As we know, there are also good reasons for restricting the model
class.

This is because a flexible model class with universal approximation


ability often results in over-fitting, which is no better than
under-fitting.

Thus, “universal approximation ⇒ low approximation error”, but at


the risk of a substantial learning error.

In general, models of intermediate flexibility give the best


predictions. For neural networks this amounts to a reasonably
sized hidden layer.

Deep Learning – 5 / 14
EXAMPLE : REGRESSION/CLASSIFICATION

Let’s look at a few examples of the types of functions and


decisions boundaries learnt by neural networks (with a single
hidden layer) of various sizes.

"size" here refers to the number of neurons in the hidden layer.

The number of "iterations" in the following slides corresponds to


the number of steps of the applied iterative optimization algorithm
(stochastic gradient descent).

Deep Learning – 6 / 14
REGRESSION EX.: 1000 TRAINING ITERATIONS
nnet: size=1; maxit=1e+03
Train: mse=0.391; CV: mse.test.mean=0.419




1.0 ●

● ●

● ●
● ●

● ●
0.5 ● ●
● ●
● ●●

● ● ● ●
● ●

●●●
y

0.0


● ●
● ●


−0.5

−1.0
● ●
● ●

0.0 2.5 5.0 7.5 10.0


x

Deep Learning – 7 / 14
REGRESSION EX.: 1000 TRAINING ITERATIONS
nnet: size=2; maxit=1e+03
Train: mse=0.088; CV: mse.test.mean=0.112




1.0 ●

● ●

● ●
● ●

● ●
0.5 ● ●
● ●
● ●●

● ● ● ●
● ●

●●●
0.0

y


● ●
● ●


−0.5

−1.0
● ●
● ●

−1.5

0.0 2.5 5.0 7.5 10.0


x

Deep Learning – 8 / 14
REGRESSION EX.: 1000 TRAINING ITERATIONS
nnet: size=3; maxit=1e+03
Train: mse=0.032; CV: mse.test.mean=0.063




1.0 ●

● ●

● ●
● ●

● ●
0.5 ● ●
● ●
● ●●

● ● ● ●
● ●

●●●
y

0.0


● ●
● ●


−0.5

−1.0
● ●
● ●

0.0 2.5 5.0 7.5 10.0


x

Deep Learning – 9 / 14
REGRESSION EX.: 1000 TRAINING ITERATIONS
nnet: size=4; maxit=1e+03
Train: mse=0.029; CV: mse.test.mean=0.055




1.0 ●

● ●

● ●
● ●

● ●
0.5 ● ●
● ●
● ●●

● ● ● ●
● ●

●●●
y

0.0


● ●
● ●


−0.5

−1.0
● ●
● ●

0.0 2.5 5.0 7.5 10.0


x

Deep Learning – 10 / 14
REGRESSION EX.: 1000 TRAINING ITERATIONS
nnet: size=5; maxit=1e+03
Train: mse=0.028; CV: mse.test.mean=19.845




1.0 ●

● ●

● ●
● ●

● ●
0.5 ● ●
● ●
● ●●

● ● ● ●
● ●

●●●
y

0.0


● ●
● ●


−0.5

−1.0
● ●
● ●

0.0 2.5 5.0 7.5 10.0


x

Deep Learning – 11 / 14
REGRESSION EX.: 1000 TRAINING ITERATIONS
nnet: size=6; maxit=1e+03
Train: mse=0.031; CV: mse.test.mean=4.374




1.0 ●

● ●

● ●
● ●

● ●
0.5 ● ●
● ●
● ●●

● ● ● ●
● ●

●●●
y

0.0


● ●
● ●


−0.5

−1.0
● ●
● ●

0.0 2.5 5.0 7.5 10.0


x

Deep Learning – 12 / 14
REGRESSION EX.: 1000 TRAINING ITERATIONS
nnet: size=10; maxit=1e+03
Train: mse=0.023; CV: mse.test.mean=0.698




1 ●

● ●

● ●
● ●
● ●

● ●
● ●
● ●●

● ● ● ●
● ●

●●●
0


y

● ●
● ●

−1
● ●
● ●

0.0 2.5 5.0 7.5 10.0


x

Deep Learning – 13 / 14
CLASSIFICATION: 500 TRAINING ITERATIONS
nnet: size=1; maxit=500
Train: mmce=0.336; CV: mmce.test.mean=0.346


●● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
1.0 ● ●● ● ●●
● ● ●
● ● ● ●●
● ● ●


● ●●
● ● ● ● ●
● ●
● ●
● ● ●
● ● ●● ●
● ● ● ● ● ●
0.5 ● ● ●● ● ●● ● ● ●● ● ●●●● ●
●●
● ● ●
● ● ●● ● ● ●
● ● ●● ● ●● ● ●
● ● ● ● ● ●● ● ●
●● ● ●●



● ● ● ● ●●● ●
● ●● ● ●
● ●
● ●● ● ● ●

● ●

class
● ● ● ●●
● ● ●
● ● ●●● ●● ●
x2

0.0 ● ●
● ● ●
● ● ● 1
●●
● ●
● ● ●● ●

● ●●
● ●
● ●
●● ● ● ● ●
● ●●

2
● ●●
●●

●●
●●●● ●● ●●●●

●●
●●

● ●● ●
● ●

●●
●●●
● ●● ●
● ● ●
● ●●●

● ● ● ●●
● ● ●● ●
● ● ●

● ● ● ● ● ●●●

● ●●●
−0.5 ● ●
●●
●● ●●●
●●
●●
●● ●
● ● ●● ●
● ●●●

●●
● ●● ●●
● ●


● ● ●● ●
● ● ● ● ●
● ● ● ●●
●●
● ●●
● ● ●● ● ● ● ● ● ● ● ●
● ●● ●●
●● ●●
● ● ● ●●● ● ●
●● ● ● ● ●
● ●●●● ● ●● ●●●
● ● ●
● ●●
●● ●
●● ●
●● ● ●●●
●● ●●● ●

●●
●●
● ●● ● ●● ●
●●●
●● ●



−1.0

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5


x1

Deep Learning – 14 / 14
CLASSIFICATION: 500 TRAINING ITERATIONS
nnet: size=2; maxit=500
Train: mmce=0.426; CV: mmce.test.mean=0.412


●● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
●●
1.0 ●
● ●●● ●● ●●●
● ● ●●
● ● ● ●

●● ● ●●
●●
● ● ●



● ●

●●

● ● ●
●●● ●●
●● ●●

● ● ● ● ●
● ●

● ● ●
●● ●●
●● ● ●
● ● ● ● ● ●● ●●●
● ●
●● ●●

●● ●● ●
● ●
0.5 ● ●●
●●●● ● ●

● ●● ●
●●
●●
●● ●●
● ●●

●● ●

● ●
● ● ●●●

● ●

● ●●
● ● ●


● ● ●
●● ●


● ●


● ●
●● ● ●
● ●
● ● ●●● ●
●●●●●


● ●



●●
● ●●
●●
●●● ●●



● ●
●● ●
● ●

● ●

● ●

● ●
● ●●
●● ●● ●●




●●

class

● ●
●●

●●●



● ●

● ●●
● ● ● ●


● ●

● ● ●● ●●

● ● ●●
●●
● ●
● ●
x2

0.0 ●

● ● ●


● 1

● ●●
● ●
●●
● ●
● ●● ●● ● 2
●●● ●●●

●●
● ●
●● ● ●

● ●


●● ●● ●
● ●
● ●


●● ● ●●
● ● ●
●●
●●
−0.5 ●● ● ●
● ●
●●● ●● ●●●
● ●●
●●
●● ●
● ●
●● ●●

● ● ● ●●
● ● ● ●



● ● ● ● ●●

● ● ● ●● ●

●●
● ● ●
●●● ●
●●
● ●●
● ●
● ●
● ● ●
●● ●
●● ● ●●●
● ●● ●● ●●
● ●
●●

●●
● ●●
●● ●
●●


−1.0

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5


x1

Deep Learning – 14 / 14
CLASSIFICATION: 500 TRAINING ITERATIONS
nnet: size=3; maxit=500
Train: mmce=0.290; CV: mmce.test.mean=0.374


●● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
1.0 ● ●● ● ●●
● ● ●
● ● ● ●●
● ● ●


● ●●
● ● ● ● ●
● ●
● ●
● ● ●
● ● ●● ●
● ● ● ● ● ●
0.5 ● ● ●● ● ●● ● ● ●● ● ●●●● ●
●●
● ● ●
● ● ●● ● ● ●
● ● ●● ● ●● ● ●
● ● ● ● ● ●● ● ●
●● ● ●●



● ● ● ● ●●● ●
● ●● ● ●
● ●
● ●● ● ● ●

● ●

class
● ● ● ●●
● ● ●
● ● ●●● ●● ●
x2

● ● ● 1
0.0 ●
●●
● ●
● ● ●
● ●●
● ●
● ●
●● ● ● ● ●
● ●

2
● ●●
●●

●●
●●●● ●● ●●●●

●●
●●

● ●● ●
● ●

●●
●●●
● ●● ●
● ● ●
● ●●●

● ● ● ●●
● ● ●● ●
● ● ●

● ● ● ●● ●●●

● ●●
−0.5 ● ●
●● ● ●●
●●
●● ● ●
● ● ●●
● ●●
● ●
● ●
● ● ● ●●
●● ●

● ● ●
● ● ● ●● ●
● ●●
● ● ●● ●
●● ● ●●● ●
● ● ●● ●
● ● ● ●
−1.0

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5


x1

Deep Learning – 14 / 14
CLASSIFICATION: 500 TRAINING ITERATIONS
nnet: size=5; maxit=500
Train: mmce=0.272; CV: mmce.test.mean=0.322


●● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
●●
1.0 ●
● ●●● ●
● ●●●
● ● ●●
● ● ● ●

●● ● ●

●●
● ●● ●


● ●
●●
●●
● ● ●●●
● ●






●● ● ●
●● ●
● ●

● ●● ●● ●● ●●●
● ●
●● ●
●● ●● ●
● ● ●
0.5 ● ●●
●●●● ● ●

● ●● ●
●●
●● ●

● ●●

●●


● ●
● ●
●●
● ●●

● ● ● ●●
● ● ●
● ●
●● ● ●●●
●● ●
●●
● ●●●

● ● ● ● ●
●● ● ●
● ● ●
● ● ● ●
● ●●



●●●●


● ●●
●●

● ●
● ● ●
class
● ●● ●
●●
● ●
●●




● ● ●
● ●
● ●

●●●

● ●
● ● ●●
●●
● ● ●
x2

● ● 1
0.0 ●
● ● ●
● ●
● ● ● ● ●

● 2
●●
● ●●
● ● ●

●● ● ●
● ● ● ● ● ●
●●
−0.5 ● ●
● ●●
● ●●
●●● ●
● ●

●●
● ●●

● ● ● ●● ●● ●
●●
●●
●●●● ●
● ●● ●● ● ● ● ●●
● ● ●● ●
● ●● ●●
●●●●

● ●●
●● ●
● ● ●
● ●●●● ● ●● ●●●
● ● ●
●●●

●●● ●

●●● ●●●
●●●●● ●

●●
●●
● ●● ●

● ●
●●●
●● ●●

−1.0

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5


x1

Deep Learning – 14 / 14
CLASSIFICATION: 500 TRAINING ITERATIONS
nnet: size=10; maxit=500
Train: mmce=0.184; CV: mmce.test.mean=0.106


●● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
1.0 ● ●● ● ●●
● ● ●
●●
● ●
● ● ●

● ●

● ●●
● ● ● ● ●
● ●






●● ● ●
●● ●
● ●

● ●● ●● ●● ●●●
● ●
●● ●
●● ●● ●
● ● ●
0.5 ● ●●
●●●● ● ●

● ●● ●
●●
●● ●


● ●●



●● ● ● ●● ●
● ●

● ●●
● ● ● ●
● ●
●●● ●
● ●

● ●● ●●●● ●
●●●●

● ●

●●
● ●
●●
●●●



● ●
●● ●
● ●
● ●
● ●
● ●●
●● ●● ● ●

class
● ●
●●

● ● ●

● ●
● ●●
● ● ●● ●
● ●
● ●●● ●
● ●
x2

0.0 ●

● ●
● ● ● 1

● ● ●
● ●

● ●● ● ● 2
●●

● ●
●● ●●

●●
● ● ●


●●

● ●
● ●
● ●●
● ● ●
●●
−0.5 ● ●
● ●● ● ● ●●
●●
● ●
● ● ● ●● ●
● ● ●
● ● ● ●● ●
● ●●
● ● ●● ●
●● ● ●●● ●
● ● ●● ●
● ● ● ●
−1.0

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5


x1

Deep Learning – 14 / 14
CLASSIFICATION: 500 TRAINING ITERATIONS
nnet: size=30; maxit=500
Train: mmce=0.000; CV: mmce.test.mean=0.034


●● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
1.0 ● ●● ● ●●
● ● ●
● ● ● ●●
● ● ●


● ●●
● ● ● ● ●
● ●
● ●
● ● ●
● ● ●● ●
● ● ● ● ● ●
0.5 ● ● ●● ● ●● ● ● ●● ● ●●●● ●
●●
● ● ●
● ● ●● ● ● ●
● ● ●● ● ●● ● ●
● ● ● ● ● ●● ● ●
●● ● ●●



● ● ● ● ●●● ●
● ●● ● ●
● ●
● ●● ● ● ●

● ●

class
● ● ● ●●
● ● ●
● ● ●●● ●● ●
x2

● ● ● 1
0.0 ● ●
● ●
● ● ● ● ●

● 2
●●
● ●●
● ● ●

●● ● ●
● ● ● ● ● ●
●●
−0.5 ● ●
● ●● ● ● ●●
●●
● ●
● ● ● ●● ●
● ● ●
● ● ● ●● ●
● ●●
● ● ●● ●
●● ● ●●● ●
● ● ●● ●
● ● ● ●
−1.0

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5


x1

Deep Learning – 14 / 14
CLASSIFICATION: 500 TRAINING ITERATIONS
nnet: size=50; maxit=500
Train: mmce=0.000; CV: mmce.test.mean=0.026


●● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
1.0 ● ●● ● ●●
● ● ●
● ● ● ●●
● ● ●


● ●●
● ● ● ● ●
● ●
● ●
● ● ●
● ● ●● ●
● ● ● ● ● ●
0.5 ● ● ●● ● ●● ● ● ●● ● ●●●● ●
●●
● ● ●
● ● ●● ● ● ●
● ● ●● ● ●● ● ●
● ● ● ● ● ●● ● ●
●● ● ●●



● ● ● ● ●●● ●
● ●● ● ●
● ●
● ●● ● ● ●

● ●

class
● ● ● ●●
● ● ●
● ● ●●● ●● ●
x2

● ● ● 1
0.0 ● ●
● ●
● ● ● ● ●

● 2
●●
● ●●
● ● ●

●● ● ●
● ● ● ● ● ●
●●
−0.5 ● ●
● ●● ● ● ●●
●●
● ●
● ● ● ●● ●
● ● ●
● ● ● ●● ●
● ●●
● ● ●● ●
●● ● ●●● ●
● ● ●● ●
● ● ● ●
−1.0

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5


x1

Deep Learning – 14 / 14
Deep Learning

Brief History

Learning goals
Predecessors of modern (deep)
neural networks
History of DL as a field
A BRIEF HISTORY OF NEURAL NETWORKS

1943: The first artificial neuron, the "Threshold Logic Unit (TLU)",
was proposed by Warren McCulloch & Walter Pitts.

The model is limited to binary inputs.


It fires/outputs +1 if the input exceeds a certain threshold θ .
The weight are not adjustable, so learning could only be
achieved by changing the threshold θ .

Deep Learning – 1 / 13
A BRIEF HISTORY OF NEURAL NETWORKS
1957: The perceptron was invented by Frank Rosenblatt.

The inputs are not restricted to be binary.


The weights are adjustable and can be learned by learning
algorithms.
As for the TLU, the threshold is adjustable and decision
boundaries are linear.

Deep Learning – 2 / 13
A BRIEF HISTORY OF NEURAL NETWORKS
1960: Adaptive Linear Neuron (ADALINE) was invented by
Bernard Widrow & Ted Hoff; weights are now adjustable according
to the weighted sum of the inputs.

1965: Group method of data handling (also known as polynomial


neural networks) by Alexey Ivakhnenko. The first learning
algorithm for supervised deep feedforward multilayer perceptrons.

Deep Learning – 3 / 13
A BRIEF HISTORY OF NEURAL NETWORKS
1969: The first “AI Winter” kicked in.
Marvin Minsky & Seymour Papert proved that a perceptron
cannot solve the XOR-Problem (linear separability).
Less funding ⇒ Standstill in AI/DL research.

1985: Multilayer perceptron with backpropagation by David


Rumelhart, Geoffrey Hinton, and Ronald Williams.
Efficiently compute derivatives of composite functions.
Backpropagation was developed already in 1970 by
Linnainmaa.
Deep Learning – 4 / 13
A BRIEF HISTORY OF NEURAL NETWORKS
1985: The second “AI Winter” kicked in.
Overly optimistic expectations concerning potential of AI/DL.
The phrase “AI” even reached a pseudoscience status.
Kernel machines and graphical models both achieved good results on
many important tasks.
Some fundamental mathematical difficulties in modeling long sequences
were identified.

Credit: https://emerj.com/ai-executive-guides/will-there-be-another-artificial-intelligence-winter-probably-not/

Deep Learning – 5 / 13
A BRIEF HISTORY OF NEURAL NETWORKS
2006: Age of deep neural networks began.
Geoffrey Hinton showed that a deep belief network could be efficiently
trained using greedy layer-wise pretraining.
This wave of research popularized the use of the term deep learning to
emphasize that researchers were now able to train deeper neural networks
than had been possible before.
At this time, deep neural networks outperformed competing AI systems
based on other ML technologies as well as hand-designed functionality.

Deep Learning – 6 / 13
A BRIEF HISTORY OF NEURAL NETWORKS

Credit: https://towardsdatascience.com/a-weird-introduction-to-deep-learning-7828803693b0

Deep Learning – 7 / 13
A BRIEF HISTORY OF NEURAL NETWORKS

Deep Learning – 8 / 13
A BRIEF HISTORY OF NEURAL NETWORKS

Figure: IBM Supercomputer

Watson is a question-answering system capable of answering questions posed in


natural language, developed in IBM’s DeepQA project.

In 2011, Watson competed on Jeopardy! against champions Brad Rutter and


Ken Jennings, winning the first place prize of $1 million.

Deep Learning – 9 / 13
A BRIEF HISTORY OF NEURAL NETWORKS

Figure: Google self driving car (Waymo)

Google’s development of self-driving technology began on January 17, 2009, at


the company’s secretive X lab.

By January 2020, 20 million miles of self-driving on public roads had been


completed by Waymo.

Deep Learning – 10 / 13
A BRIEF HISTORY OF NEURAL NETWORKS

Credit: DeepMind

AlphaFold is a deep learning system, developed by Google DeepMind, for


determining a protein’s 3D shape from its amino-acid sequence.

In 2018 and 2020, AlphaFold placed first in the overall rankings of the Critical
Assessment of Techniques for Protein Structure Prediction (CASP).

Deep Learning – 11 / 13
A BRIEF HISTORY OF NEURAL NETWORKS

Credit: DeepMind

AlphaGo, originally developed by DeepMind, is a deep learning system that


plays the board game Go. In 2017, the Master version of AlphaGo beat Ke Jie,
the number one ranked player in the world at the time.

While there are several extensions to AlphaGo (e.g., Master AlphaGo, AlphaGo
Zero, AlphaZero, and MuZero), the main idea is the same: search for optimal
moves based on knowledge acquired by machine learning.

Deep Learning – 12 / 13
A BRIEF HISTORY OF NEURAL NETWORKS

Generative Pre-trained Transformer 3 (GPT-3) is the third generatation of the


GPT model, introduced by OpenAI in May 2020, to produce human-like text.

There are 175 billion parameters to be learned by the algorithm, but the quality of
the generated text is so high that it is hardly possible to distinguish it from a
human-written text.

Deep Learning – 13 / 13
Deep Learning

Basic Training

Learning goals
Empirical risk minimization
Gradient descent
Stochastic gradient descent
TRAINING NEURAL NETWORKS
In ML we use empirical risk minimization (ERM) to minimize
prediction losses over the training data
n
1 X  (i )  (i ) 
Remp = L y ,f x | θ
n
i =1

In DL, θ represents the weights (and biases) of the NN.


Often, L2 -loss in regression:

1
L (y , f (x)) = (y − f (x))2
2
or cross-entropy for binary classification

L (y , f (x)) = −(y log f (x) + (1 − y ) log(1 − f (x)))

ERM can be implemented by gradient descent (GD).

Deep Learning – 1 / 11
GRADIENT DESCENT
Neg. risk gradient points in the direction of the steepest descent
 ⊤
∂Remp ∂Remp
−g = −∇Remp (θ) = − ,...,
∂θ1 ∂θd

“Standing” at a point θ [t ] , we locally improve by:

θ [t +1] = θ [t ] − αg,

α is called step size or learning rate.

3.0
−0.15

2.5
−0.25

−0.35
−0.1

2.0
−0.45
θ2

1.5
−0.2
1.0

−0.4
−0.3
3 −0.3
0.5

−0.2
−0.4 2
−0 −0.1 5
.05
0.0

.0
−0
0

0 1
1 0.0 0.5 1.0 1.5 2.0 2.5 3.0
2 θ1
0

Deep Learning – 2 / 11
GRADIENT DESCENT AND OPTIMALITY
GD is a greedy algorithm: In
every iteration, it makes
locally optimal moves.

If Remp (θ) is convex and


differentiable, and its
gradient is Lipschitz
continuous, GD is guaranteed
to converge to the global
minimum (for small enough
step-size).

However, if Remp (θ) has


multiple local optima and/or
saddle points, GD might only
converge to a stationary point
(other than the global
optimum), depending on the
starting point.

Deep Learning – 3 / 11
GRADIENT DESCENT AND OPTIMALITY
Note: It might not be that bad if we do not find the global optimum:
We don’t optimize the (theoretical) risk, but only an approximate
version, i.e. the empirical risk.
For very flexible models, aggressive optimization might overfitting.
Early-stopping might even increase generalization performance.

Deep Learning – 4 / 11
LEARNING RATE (LR)
The step-size α plays a key role in the convergence of the algorithm.

If the step size is too small, the training process may converge very
slowly (see left image). If the step size is too large, the process may not
converge, because it jumps around the optimal point (see right image).

Deep Learning – 5 / 11
LEARNING RATE
So far we have assumed a fixed value of α in every iteration:

α[t ] = α ∀t = {1, . . . , T }
However, it makes sense to adapt α in every iteration:

Steps of gradient descent for Remp (θ) = 10 θ12 + 0.5 θ22 . Left: 100 steps for with a fixed
learning rate. Right: 40 steps with an adaptive learning rate.

Deep Learning – 6 / 11
WEIGHT INITIALIZATION
Weights (and biases) of an NN must be initialized in GD.
We somehow must "break symmetry" – which would happen in
full-0-initialization. If two neurons (with the same activation) are
connected to the same inputs and have the same initial weights,
then both neurons will have the same gradient update and learn
the same features.
Weights are typically drawn from a uniform a Gaussian distribution
(both centered at 0 with a small variance).
Two common initialization strategies are ’Glorot initialization’ and
’He initialization’ which tune the variance of these distributions
based on the topology of the network.

Deep Learning – 7 / 11
STOCHASTIC GRADIENT DESCENT (SGD)
GD for ERM was:
n
[ t + 1] [t ] 1 X  
θ =θ −α· · ∇θ L y (i ) , f (x(i ) | θ [t ] )
n
i =1

Using the entire training set in GD to is called batch or


deterministic or offline training. This can be computationally
costly or impossible, if data does not fit into memory.
Idea: Instead of letting the sum run over the whole dataset, use
small stochastic subsets (minibatches), or only a single x(i ) .
If batches are uniformly sampled from Dtrain , our stochastic
gradient is in expectation the batch gradient ∇θ Remp (θ).
The gradient w.r.t. a single x is fast to compute but not reliable. But
its often used in a theoretical analysis of SGD.
→ We have a stochastic, noisy version of the batch GD.

Deep Learning – 8 / 11
STOCHASTIC GRADIENT DESCENT (SGD)
SGD on function 1.25(x1 + 6)2 + (x2 − 8)2 .

Source : Shalev-Shwartz and Ben-David. Understanding machine learning: From theory to algorithms. Cambridge University
Press, 2014.

Figure: Left = GD, right = SGD. Black line is an average of different SGD runs.

Deep Learning – 9 / 11
STOCHASTIC GRADIENT DESCENT

Algorithm Basic SGD pseudo code


1: Initialize parameter vector θ [0]
2: t ← 0
3: while stopping criterion not met do
4: Randomly shuffle data and partition into minibatches J1 , ..., JK of size m
5: for k ∈ {1, ..., K } do
6: t ←t +1
Compute gradient estimate with Jk : ĝ [t ] ← m1 (i ) (i )
| θ [t −1] ))
P
7: i ∈Jk ∇θ L(y , f (x
8: Apply update: θ [t ] ← θ [t −1] − αĝ [t ]
9: end for
10: end while

A full SGD pass over data is an epoch.


Minibatch sizes are typically between 50 and 1000.

Deep Learning – 10 / 11
STOCHASTIC GRADIENT DESCENT
SGD is the most used optimizer in ML and especially in DL.
We usually have to add a considerable amount of tricks to SGD to
make it really efficient (e.g. momentum). More on this later.
SGD with (small) batches has a high variance, although is
unbiased. Hence, the LR α is smaller than in the batch mode.
When LR is slowly decreased, SGD converges to local minimum.
Recent results indicate that SGD often leads to better
generalization than GD, and may result in indirect regularization.

Deep Learning – 11 / 11
Deep Learning

Chain Rule and Computational Graphs

Learning goals
Chain rule of calculus
Computational graphs
CHAIN RULE OF CALCULUS
The chain rule can be used to compute derivatives of the
composition of two or more functions.
Let x ∈ Rm , y ∈ Rn ,
g : Rm → Rn and f : Rn → R.
If y = g (x) and z = f (y), the chain rule yields:

∂z X ∂ z ∂ yj
= ·
∂ xi ∂ yj ∂ xi
j

or, in vector notation:


 ∂ y ⊤
∇x z = ∇y z ,
∂x
∂y
where ∂x is the (n × m) Jacobian matrix of g.

Deep Learning – 2 / ??
COMPUTATIONAL GRAPHS

CGs are nested expresssions,


visualized as graphs.
Each node is a variable, either
an input or derived.
Derived variables are functions
applied to other variables.
source : Goodfellow et al. (2016)

Figure: The computational graph for


the expression H = σ(XW + B ) with
activation function σ(·).

Deep Learning – 3 / ??
CHAIN RULE OF CALCULUS: EXAMPLE 1

Suppose we have the following


computational graph.
To compute the derivative of
∂z
∂ w we need to recursively
apply the chain rule. That is:

∂z ∂z ∂y ∂x
= · ·
∂w ∂y ∂x ∂w
= f3′ (y ) · f2′ (x ) · f1′ (w )
source : Goodfellow et al. (2016)
= f3′ (f2 (f1 (w ))) · f2′ (f1 (w )) · f1′ (w )
Figure: A computational
graph, such that
x = f1 (w ), y = f2 (x )
and z = f3 (y ).

Deep Learning – 4 / ??
CHAIN RULE OF CALCULUS: EXAMPLE 2

To compute ∇x z, we apply the chain rule


∂z P ∂ z ∂ yj ∂ z ∂ y1 ∂ z ∂ y2
∂ x1 = j ∂ yj ∂ x1 = ∂ y1 ∂ x1 + ∂ y2 ∂ x1
∂z P ∂ z ∂ yj ∂ z ∂ y1 ∂ z ∂ y2
∂ x2 = j ∂ yj ∂ x2 = ∂ y1 ∂ x2 + ∂ y2 ∂ x2

Therefore, the gradient of z w.r.t x is


" # " #" #
∂z ∂ y1 ∂ y2 ∂z  ⊤
∂ x1 ∂ x1 ∂ x1 ∂ y1 ∂y
∇x z = ∂z = ∂ y1 ∂ y2 ∂z = ∂x ∇y z
∂ x2 ∂ x2 ∂ x2 ∂ y2
| {z } | {z }
( ∂∂ yx )⊤ ∇y z

Deep Learning – 5 / ??
COMPUTATIONAL GRAPH: NEURAL NET

Figure: A neural network can be seen as a computational graph. ϕ is the


weighted sum and σ and τ are the activations.
Note: In contrast to the top figure, the arrows in the computational graph below
merely indicate dependence, not weights.

Deep Learning – 6 / ??
Deep Learning

Basic Backpropagation 1

Learning goals
Forward and backward passes
Chain rule
Details of backprop
BACKPROPAGATION: BASIC IDEA
We would like to optimize ERM using gradient descent (GD) on:
n
1 X  (i )  (i ) 
Remp (θ) = L y ,f x | θ .
n
i =1

Backprop training of NNs runs in 2 alternating steps, for one x:


1 Forward pass (FP): Inputs flow through model to outputs. We
then compute the observation loss (see previous chapters).
2 Backward pass (BP): Loss flows backwards to update weights so
error is reduced, as in GD.

We will see: This is simply (S)GD in disguise, cleverly using the chain
rule, so we can reuse a lot of intermediate results.

Bernd Bischl Deep Learning – 1 / 19


XOR EXAMPLE
As activations (hidden and outputs) we use the sigmoid function.
We run one FP and BP on x = (1, 0)T with y = 1.
We use L2 loss between 0-1 labels and the predicted probabilities.
This is a bit uncommon, but computations become simpler.

Note: We will only show rounded decimals.

Bernd Bischl Deep Learning – 2 / 19


FORWARD PASS
We will divide the FP into four steps:
the inputs of zi : zi ,in
the activations of zi : zi ,out
the input of f : fin
and finally the activation of f : fout

Bernd Bischl Deep Learning – 3 / 19


FORWARD PASS

T
z1,in = W1 x + b1 = 1 · (−0.07) + 0 · 0.22 + 1 · (−0.46) = −0.53
1
z1,out = σ (z1,in ) = = 0.3705
1 + exp(−(−0.53))

Bernd Bischl Deep Learning – 4 / 19


FORWARD PASS

T
z2,in = W2 x + b2 = 1 · 0.94 + 0 · 0.46 + 1 · 0.1 = 1.04
1
z2,out = σ (z2,in ) = = 0.7389
1 + exp(−1.04)

Bernd Bischl Deep Learning – 5 / 19


FORWARD PASS

T
fin = u z + c = 0.3705 · (−0.22) + 0.7389 · 0.58 + 1 · 0.78 = 1.1122
1
fout = τ (fin ) = = 0.7525
1 + exp(−1.1122)

Bernd Bischl Deep Learning – 6 / 19


FORWARD PASS
The FP predicted fout = 0.7525
Now, we compare the prediction fout = 0.7525 and the true label
y = 1 using the L2 -loss:

1   1
L (y , f (x)) = (y − f x(i ) | θ )2 = (y − fout )2
2 2
1 2
= (1 − 0.7525) = 0.0306
2
The calculation of the gradient is performed backwards (starting
from the output layer), so that results can be reused.

Bernd Bischl Deep Learning – 7 / 19


BACKWARD PASS
The main ingredients of the backward pass are:
to reuse the results of the forward pass
(here: zi ,in , zi ,out , fin , fout )
reuse the intermediate results from the chain rule
the derivative of the activations and some affine functions

Bernd Bischl Deep Learning – 8 / 19


BACKWARD PASS
Let’s start to update u1 . We recursively apply the chain rule:

∂ L (y , f (x)) ∂ L (y , f (x)) ∂ fout ∂ fin


= · ·
∂ u1 ∂ fout ∂ fin ∂ u1

Figure: Snippet from our NN, with backward path for u1 .

Bernd Bischl Deep Learning – 9 / 19


BACKWARD PASS
1st step: The derivative of L2 loss is easy; we know fout from FP.

∂ L (y , f (x)) d 1
= (y − fout )2 = − (y − fout )
∂ fout ∂ fout 2 | {z }
=
ˆ residual
= −(1 − 0.7525) = −0.2475

Bernd Bischl Deep Learning – 10 / 19


BACKWARD PASS
2nd step. fout = σ(fin ), use rule for σ ′ , use fin from FP.

∂ fout
= σ(fin ) · (1 − σ(fin ))
∂ fin
= 0.7525 · (1 − 0.7525) = 0.1862

Bernd Bischl Deep Learning – 11 / 19


BACKWARD PASS
3rd step. Derivative of the linear input is easy; use z1,out from FP.

∂ fin ∂(u1 · z1,out + u2 · z2,out + c · 1)


= = z1,out = 0.3705
∂ u1 ∂ u1

Bernd Bischl Deep Learning – 12 / 19


BACKWARD PASS
Plug it together:

∂ L (y , f (x)) ∂ L (y , f (x)) ∂ fout ∂ fin


= · ·
∂ u1 ∂ fout ∂ fin ∂ u1
= −0.2475 · 0.1862 · 0.3705 = −0.0171

With LR α = 0.5:

[new ] ∂ L (y , f (x))
[old ]
u1 = u1 −α·
∂ u1
= −0.22 − 0.5 · (−0.0171) = −0.2115

Bernd Bischl Deep Learning – 13 / 19


BACKWARD PASS
Now for W11 :

∂ L (y , f (x)) ∂ L (y , f (x)) ∂ fout ∂ fin ∂ z1,out ∂ z1,in


= · · · ·
∂ W11 ∂ fout ∂ fin ∂ z1,out ∂ z1,in ∂ W11

∂ L(y ,f (x)) ∂ fout


We know ∂ fout and ∂ fin from BP for u1 .

Bernd Bischl Deep Learning – 14 / 19


BACKWARD PASS
fin = u1 · z1,out + u2 · z2,out + c · 1 is linear, easy and we know u1 :

∂ fin
= u1 = −0.22
∂ z1,out

Bernd Bischl Deep Learning – 15 / 19


BACKWARD PASS
Next. Use rule for σ ′ and FP results:

∂ z1,out
= σ(z1,in ) · (1 − σ(z1,in ))
∂ z1,in
= 0.3705 · (1 − 0.3705) = 0.2332

Bernd Bischl Deep Learning – 16 / 19


BACKWARD PASS
z1,in = x1 · W11 + x2 · W21 + b1 · 1 is linear and depends on inputs:

∂ z1,in
= x1 = 1
∂ W11

Bernd Bischl Deep Learning – 17 / 19


BACKWARD PASS
Plugging together:
∂ L (y , f (x)) ∂ L (y , f (x)) ∂ fout ∂ fin ∂ z1,out ∂ z1,in
= · · · ·
∂ W11 ∂ fout ∂ fin ∂ z1,out ∂ z1,in ∂ W11
= (−0.2475) · 0.1862 · (−0.22) · 0.2332 · 1
= 0.0024

Full SGD update:

[new ] [old ] ∂ L (y , f (x))


W11 = W11 −α·
∂ W11
= −0.07 − 0.5 · 0.0024 = −0.0712

Bernd Bischl Deep Learning – 18 / 19


RESULT
We can do this for all weights:
   
−0.0712 0.9426 −0.4612
W = ,b = ,
0.22 0.46 0.1026

 
−0.2115
u= and c = 0.8030.
0.5970

Yields f (x | θ [new ] ) = 0.7615 and loss 21 (1 − 0.7615)2 = 0.0284.


Before, we had f (x | θ [old ] ) = 0.7525 and higher loss 0.0306.

Now rinse and repeat. This was one training iter, we do thousands.

Bernd Bischl Deep Learning – 19 / 19


Deep Learning

Basic Backpropagation 2

Learning goals
Backprop formalism and
recursion
BACKWARD COMPUTATION AND CACHING
In the XOR example, we computed:

∂ L (y , f (x)) ∂ L (y , f (x)) ∂ fout ∂ fin ∂ z1,out ∂ z1,in


= · · · ·
∂ W11 ∂ fout ∂ fin ∂ z1,out ∂ z1,in ∂ W11

Deep Learning – 1 / 9
BACKWARD COMPUTATION AND CACHING
Next, let us compute:

∂ L (y , f (x)) ∂ L (y , f (x)) ∂ fout ∂ fin ∂ z1,out ∂ z1,in


= · · · ·
∂ W21 ∂ fout ∂ fin ∂ z1,out ∂ z1,in ∂ W21

Deep Learning – 2 / 9
BACKWARD COMPUTATION AND CACHING
Examining the two expressions:
∂ L (y , f (x)) ∂ L (y , f (x)) ∂ fout ∂ fin ∂ z1,out ∂ z1,in
= · · · ·
∂ W11 ∂ fout ∂ fin ∂ z1,out ∂ z1,in ∂ W11

∂ L (y , f (x)) ∂ L (y , f (x)) ∂ fout ∂ fin ∂ z1,out ∂ z1,in


= · · · ·
∂ W21 ∂ fout ∂ fin ∂ z1,out ∂ z1,in ∂ W21
Significant overlap / redundancy in the two expressions.
Again: Let’s call this subexpression δ1 and cache it.

∂ L (y , f (x)) ∂ L (y , f (x)) ∂ fout ∂ fin ∂ z1,out


δ1 = = · · ·
∂ z1,in ∂ fout ∂ fin ∂ z1,out ∂ z1,in
∂ L (y , f (x)) ∂ z1,in ∂ L (y , f (x)) ∂ z1,in
= δ1 · and = δ1 ·
∂ W11 ∂ W11 ∂ W21 ∂ W21
δ1 can also be seen as an error signal that represents how much
the loss L changes when the input z1,in changes.
Deep Learning – 3 / 9
BACKPROP: RECURSION
Let us now derive a general formulation of backprop.
The neurons in layers i − 1, i and i + 1 are indexed by j, k and m,
respectively.
The output layer will be referred to as layer O.

Credit: Erik Hallström

Deep Learning – 4 / 9
BACKPROP: RECURSION

(i )
Let δk̃ (also: error signal) for a neuron k̃ in layer i represent how much
(i )
the loss L changes when the input zk̃ ,in changes:

(i ) (i +1)
! (i )
(i ) ∂L ∂L ∂ zk̃ ,out X ∂L ∂ zm,in ∂ zk̃ ,out
δk̃ = (i )
= (i ) (i )
= (i +1) (i ) (i )
∂ zk̃ ,in ∂ zk̃ ,out ∂ zk̃ ,in m ∂ zm,in ∂ zk̃ ,out ∂ zk̃ ,in

Note: The sum in the expression above is over all the neurons in layer
i + 1. This is simply an application of the (multivariate) chain rule.

Deep Learning – 5 / 9
BACKPROP: RECURSION
Using (i ) (i )
zk̃ ,out = σ(zk̃ ,in )
(i +1)
X (i + 1 ) (i ) (i + 1 )
zm,in = Wk ,m zk ,out + bm
k

we get:
(i +1)
! (i )
(i )
X ∂L ∂ zm,in ∂ zk̃ ,out
δk̃ = (i + 1 ) (i ) (i )
m ∂ zm,in ∂ zk̃ ,out ∂ zk̃ ,in
P 
(i + 1 ) (i ) (i + 1 ) ! (i )
X ∂L ∂ k Wk ,m zk ,out + bm ∂σ(zk̃ ,in )
= (i + 1 ) (i ) (i )
m ∂ zm,in ∂ zk̃ ,out ∂ zk̃ ,in
! !
X ∂L X
σ ′ (zk̃ ,in ) = σ ′ (zk̃ ,in )
(i + 1 ) (i ) (i +1) (i +1) (i )
= (i + 1 )
Wk̃ ,m δm,in Wk̃ ,m
m ∂ zm,in m

Therefore, we now have a recursive definition for the error signal of a


neuron in layer i in terms of the error signals of the neurons in layer
i + 1 and, by extension, layers {i+2, i+3 . . . , O}!

Deep Learning – 6 / 9
BACKPROP: RECURSION
(i )
Given the error signal δ of neuron k̃ in layer i, the derivative of

loss L w.r.t. to the weight Wj̃ ,k̃ is simply:

(i )
∂L ∂ L ∂ zk̃ ,in (i ) (i −1)
(i )
= (i ) (i )
= δk̃ zj̃ ,out
∂ Wj̃ ,k̃ ∂ zk̃ ,in ∂ Wj̃ ,k̃

(i ) P (i ) (i −1) (i )
because z
k̃ ,in
= j W z
j ,k̃ j ,out
+ bk̃
(i )
Similarly, the derivative of loss L w.r.t. bias b is:

(i )
∂L ∂ L ∂ zk̃ ,in (i )
(i )
= (i ) (i )
= δk̃
∂ bk̃ ∂ zk̃ ,in ∂ bk̃

Deep Learning – 7 / 9
BACKPROP: RECURSION
It is not hard to show that the error signal δ i for an entire layer i is
(⊙ = element-wise product):
δ (O ) = ∇fout L ⊙ τ ′ (fin )
(i )
δ (i ) = W (i +1) δ (i +1) ⊙ σ ′ (zin )
Therefore, backpropagation works by computing and storing the
error signals backwards. That is, starting at the output layer and
ending at the first hidden layer. This way, the error signals of later
layers propagate backwards to the earlier layers.
The derivative of the loss L w.r.t. a given weight is computed
efficiently by plugging in the cached error signals, thereby avoiding
expensive and redundant computations.

Deep Learning – 8 / 9
Deep Learning

Hardware and Software

Learning goals
GPU training for accelerated
learning
Software for hardware support
Deep learning software platforms
Hardware for Deep Learning

Deep Learning – 1 / 9
HARDWARE FOR DEEP LEARNING
Deep NNs require special hardware to be trained efficiently.
The training is done using Graphics Processing Units (GPUs) and
a special programming language called CUDA.
For most NNs, training on standard CPUs takes very long.

Figure: Left: Each CPU can do 2-8 parallel computations. Right: A single
GPU can do thousands of simple parallel computations.

Deep Learning – 2 / 9
GRAPHICS PROCESSING UNITS (GPUS)

Initially developed to accelerate the creation of graphics


Massively parallel: identical and independent computations for
every pixel
Computer Graphics makes heavy use of linear algebra (just like
neural networks)
Less flexible than CPUs: all threads in a core concurrently execute
the same instruction on different data.
Very fast for CNNs, RNNs need more time
Popular ones: GTX 1080 Ti, RTX 3080 / 2080 Ti, Titan RTX, Tesla
V100 / A100
Hundreds of threads per core, few thousand cores, around 10
teraFLOPS in single precision, some 10s GBs of memory
Memory is important - some SOTA architectures do not fit GPUs
with <10 GB

Deep Learning – 3 / 9
TENSOR PROCESSING UNITS (TPUS)

Specialized and proprietary chip for deep learning developed by


Google
Hundreds of teraFLOPS per chip
Can be connected together in pods of thousands TPUs each
(result: hundreds of petaFLOPS per pod)
Not a consumer product – can be used in the Google Cloud
Platform (from >1.35 USD / TPU / hour) or Google Colab (free!)
Enabled impressive progress : DeepMind’s AlphaZero for Chess
became world champion after just 4h of training on 5064 TPUs

Deep Learning – 4 / 9
AND EVERYTHING ELSE...
With such powerful devices, memory/disk access during training
becomes the bottleneck
Nvidia DGX-1: Specialized solution with eight Tesla V100
GPUs, dual Intel Xeon, 512 GB of RAM, 4 SSD disks of 2TB
each
Specialized hardware for on-device inference
Example: Neural Engine on the Apple A11 (used for FaceID)
Keywords/buzzwords: Edge computing and Federated
learning

Deep Learning – 5 / 9
Software for Deep Learning

Deep Learning – 6 / 9
SOFTWARE FOR DEEP LEARNING
CUDA is a very low level programming language and thus writing
code for deep learning requires a lot of work.
Deep learning (software) frameworks:
Abstract the hardware (same code for CPU/GPU/TPU)
Automatically differentiate all computations
Distribute training among several hosts
Provide facilities for visualizing and debugging models
Can be used from several programming languages
Based on the concept of computational graph

Deep Learning – 7 / 9
SOFTWARE FOR DEEP LEARNING
Tensorflow
Popular in the industry
Developed by Google and
open source community
Python, R, C++ and Javascript APIs
Distributed training on GPUs and TPUs
Tools for visualizing neural nets, running
them efficiently on phones and embedded devices.
Keras
Intuitive, high-level wrapper
of Tensorflow for rapid prototyping
Python and (unofficial) R APIs

Deep Learning – 8 / 9
SOFTWARE FOR DEEP LEARNING
Pytorch
(Most) Popular in academia
Supported by Facebook
Python and C++ APIs
Distributed training on GPUs

MXNet
Open-source deep learning framework
written in C++ and cuda (used by
Amazon for their Amazon Web Services)
Scalable, allowing fast model training
Supports flexible model programming and
multiple languages (C++, Python, Julia,
Matlab, JavaScript, Go, R, Scala, Perl)

Deep Learning – 9 / 9
Deep Learning

Basic Regularization

Learning goals
Regularized cost functions
Norm penalties
Weight decay
Equivalence with constrained
optimization
REGULARIZATION
Any technique that is designed to reduce the test error possibly at
the expense of increased training error can be considered a form
of regularization.
Regularization is important in DL because NNs can have
extremely high capacity (millions of parameters) and are thus
prone to overfitting.

Deep Learning – 1 / 9
REVISION: REGULARIZED RISK MINIMIZATION
The goal of regularized risk minimization is to penalize the
complexity of the model to minimize the chances of overfitting.
By adding a parameter norm penalty term J (θ) to the empirical
risk Remp (θ) we obtain a regularized cost function:

Rreg (θ) = Remp (θ) + λJ (θ)

with hyperparamater λ ∈ [0, ∞), that weights the penalty term,


relative to the unconstrained objective function Remp (θ).
Therefore, instead of pure empirical risk minimization, we add a
penalty for complex (read: large) parameters θ .
Declaring λ = 0 obviously results in no penalization.
We can choose between different parameter norm penalties J (θ).
In general, we do not penalize the bias.

Deep Learning – 2 / 9
L2-REGULARIZATION / WEIGHT DECAY
Let us optimize the L2-regularized risk of a model f (x | θ)

λ
min Rreg (θ) = min Remp (θ) + kθk22
θ θ 2
by gradient descent. The gradient is

∇Rreg (θ) = ∇Remp (θ) + λθ.

We iteratively update θ by step size α times the negative gradient


 
θ [new] = θ [old] −α ∇Remp (θ) + λθ [old] = θ [old] (1−αλ)−α∇Remp (θ)

→ The term λθ [old ] causes the parameter (weight) to decay in


proportion to its size (which gives rise to the name).

Deep Learning – 3 / 9
EQUIVALENCE TO CONSTRAINED OPTIMIZATION
Norm penalties can be interpreted as imposing a constraint on the
weights. One can show that

arg min Remp (θ) + λJ (θ)


θ

is equvilalent to

arg min Remp (θ)


θ
subject to J (θ) ≤ k

for some value k that depends on λ the nature of Remp (θ).


(Goodfellow et al. (2016), ch. 7.2)

Deep Learning – 4 / 9
EXAMPLE: WEIGHT DECAY

We fit the huge neural


network on the right side
on a smaller fraction of
MNIST (5000 train and
1000 test observations)
Weight decay: λ ∈
(10−2 , 10−3 , 10−4 , 10−5 , 0)

Deep Learning – 5 / 9
EXAMPLE: WEIGHT DECAY
0.100

0.075

weight decay
0
train error

10^(−5)
0.050
10^(−4)
10^(−3)
10^(−2)

0.025

0.000

0 50 100 150 200


epochs

A high weight decay of 10−2 leads to a high error on the training data.

Deep Learning – 6 / 9
EXAMPLE: WEIGHT DECAY
0.100

0.075

weight decay
0
test error

10^(−5)
0.050
10^(−4)
10^(−3)
10^(−2)

0.025

0.000

0 50 100 150 200


epochs

Second strongest weight decay leads to the best result on the test data.

Deep Learning – 7 / 9
TENSORFLOW PLAYGROUND

https://playground.tensorflow.org/

Deep Learning – 8 / 9
TENSORFLOW PLAYGROUND - EXERCISE

https://developers.google.com/machine-learning/crash-course/
regularization-for-simplicity/
playground-exercise-examining-l2-regularization

Deep Learning – 9 / 9
Introduction to Machine Learning

Regularization in Non-Linear Models and


Bayesian Priors

Learning goals
Understand that regularization
and parameter shrinkage can be
applied to non-linear models
Know structural risk minimization
Know how regularization risk
minimization is the same as MAP
in a Bayesian perspective, where
the penalty corresponds to
parameter prior.
SUMMARY: REGULARIZED RISK MINIMIZATION
If we should define ML in only one line, this might be it:
n
!
X   
min Rreg (θ) = min L y (i ) , f x(i ) | θ + λ · J (θ)
θ θ
i =1

We can choose for a task at hand:


the hypothesis space of f , which determines how features can
influence the predicted y
the loss function L, which measures how errors should be treated
the regularization J (θ), which encodes our inductive bias and
preference for certain simpler models

By varying these choices one can construct a huge number of different


ML models. Many ML models follow this construction principle or can
be interpreted through the lens of regularized risk minimization.

c Introduction to Machine Learning – 1 / 13


REGULARIZATION IN NONLINEAR MODELS

So far we have mainly considered regularization in LMs.


Can also be applied to non-linear models (with numeric
parameters), where it is often important to prevent overfitting.
Here, we typically use L2 regularization, which still results in
parameter shrinkage and weight decay.
By adding regularization, prediction surfaces in regression and
classification become smoother.
Note: In the chapter on non-linear SVMs we will study the effects
of regularization on a non-linear model in detail.

c Introduction to Machine Learning – 2 / 13


REGULARIZATION IN NONLINEAR MODELS
Setting: Classification for the spirals data. Neural network with single
hidden layer containing 10 neurons and logistic output activation, regularized
with L2 penalty term for λ > 0. Varying λ affects smoothness of the decision
boundary and magnitude of network weights:

c Introduction to Machine Learning – 3 / 13


REGULARIZATION IN NONLINEAR MODELS
Setting: Classification for the spirals data. Neural network with single
hidden layer containing 10 neurons and logistic output activation, regularized
with L2 penalty term for λ > 0. Varying λ affects smoothness of the decision
boundary and magnitude of network weights:

c Introduction to Machine Learning – 3 / 13


REGULARIZATION IN NONLINEAR MODELS
Setting: Classification for the spirals data. Neural network with single
hidden layer containing 10 neurons and logistic output activation, regularized
with L2 penalty term for λ > 0. Varying λ affects smoothness of the decision
boundary and magnitude of network weights:

c Introduction to Machine Learning – 3 / 13


REGULARIZATION IN NONLINEAR MODELS
Setting: Classification for the spirals data. Neural network with single
hidden layer containing 10 neurons and logistic output activation, regularized
with L2 penalty term for λ > 0. Varying λ affects smoothness of the decision
boundary and magnitude of network weights:

c Introduction to Machine Learning – 3 / 13


REGULARIZATION IN NONLINEAR MODELS
The prevention of overfitting can also be seen in CV. Same settings as
before, but each λ is evaluated with repeated CV (10 folds, 5 reps).

We see the typical U-shape with the sweet spot between overfitting
(LHS, low λ) and underfitting (RHS, high λ) in the middle.

c Introduction to Machine Learning – 4 / 13


STRUCTURAL RISK MINIMIZATION

Thus far, we only considered adding a complexity penalty to


empirical risk minimization.
Instead, structural risk minimization (SRM) assumes that the
hypothesis space H can be decomposed into increasingly
complex hypotheses (size or capacity): H = ∪k ≥1 Hk .
Complexity parameters can be the, e.g. the degree of polynomials
in linear models or the size of hidden layers in neural networks.

c Introduction to Machine Learning – 5 / 13


STRUCTURAL RISK MINIMIZATION
SRM chooses the smallest k such that the optimal model from Hk
found by ERM or RRM cannot significantly be outperformed by a
model from a Hm with m > k .
By this, the simplest model can be chosen, which minimizes the
generalization bound.
One challenge might be choosing an adequate complexity
measure, as for some models, multiple complexity measures exist.

generalization error

complexity

training error

c Introduction to Machine Learning – 6 / 13


STRUCTURAL RISK MINIMIZATION
Setting: Classification for the spirals data. Neural network with single
hidden layer containing k neurons and logistic output activation, L2 regularized
with λ = 0.001. So here SRM and RRM are both used. Varying the size of the
hidden layer affects smoothness of the decision boundary:

c Introduction to Machine Learning – 7 / 13


STRUCTURAL RISK MINIMIZATION
Setting: Classification for the spirals data. Neural network with single
hidden layer containing k neurons and logistic output activation, L2 regularized
with λ = 0.001. So here SRM and RRM are both used. Varying the size of the
hidden layer affects smoothness of the decision boundary:

c Introduction to Machine Learning – 7 / 13


STRUCTURAL RISK MINIMIZATION
Setting: Classification for the spirals data. Neural network with single
hidden layer containing k neurons and logistic output activation, L2 regularized
with λ = 0.001. So here SRM and RRM are both used. Varying the size of the
hidden layer affects smoothness of the decision boundary:

c Introduction to Machine Learning – 7 / 13


STRUCTURAL RISK MINIMIZATION
Setting: Classification for the spirals data. Neural network with single
hidden layer containing k neurons and logistic output activation, L2 regularized
with λ = 0.001. So here SRM and RRM are both used. Varying the size of the
hidden layer affects smoothness of the decision boundary:

c Introduction to Machine Learning – 7 / 13


STRUCTURAL RISK MINIMIZATION
Setting: Classification for the spirals data. Neural network with single
hidden layer containing k neurons and logistic output activation, L2 regularized
with λ = 0.001. So here SRM and RRM are both used. Varying the size of the
hidden layer affects smoothness of the decision boundary:

c Introduction to Machine Learning – 7 / 13


STRUCTURAL RISK MINIMIZATION
Setting: Classification for the spirals data. Neural network with single
hidden layer containing k neurons and logistic output activation, L2 regularized
with λ = 0.001. So here SRM and RRM are both used. Varying the size of the
hidden layer affects smoothness of the decision boundary:

c Introduction to Machine Learning – 7 / 13


STRUCTURAL RISK MINIMIZATION
Again, complexity vs CV score.

A minimal model with good generalization seems to have ca. 6-8


hidden neurons.

c Introduction to Machine Learning – 8 / 13


STRUCTURAL RISK MINIMIZATION AND RRM
Note that normal RRM can also be interpreted through SRM, if we
rewrite the penalized ERM as constrained ERM.

n
X   
min L y (i ) , f x(i ) | θ
θ
i =1

s.t. kθk22 ≤ t

We can interpret going through λ from large to small as through t from


small to large. This constructs a series of ERM problems with
hypothesis spaces Hλ , where we constrain the norm of θ to unit balls
of growing size.

c Introduction to Machine Learning – 9 / 13


RRM VS. BAYES
We already created a link between max. likelihood estimation and ERM.

Now we will generalize this for RRM.

Assume we have a parameterized distribution p(y |θ, x) for our data and
a prior q (θ) over our parameter space, all in the Bayesian framework.

From the Bayes theorem we know:

p(y |θ, x)q (θ)


p(θ|x, y ) = ∝ p(y |θ, x)q (θ)
p(y |x)

c Introduction to Machine Learning – 10 / 13


RRM VS. BAYES
The maximum a posteriori (MAP) estimator of θ is now the minimizer of

− log p (y | θ, x) − log q (θ).

Again, we identify the loss L (y , f (x | θ)) with − log(p(y |θ, x)).


If q (θ) is constant (i.e., we used a uniform, non-informative prior),
the second term is irrelevant and we arrive at ERM.
If not, we can identify J (θ) ∝ − log(q (θ)), i.e., the log-prior
corresponds to the regularizer, and the additional λ, which controls
the strength of our penalty, usually influences the peakedness /
inverse variance / strength of our prior.

c Introduction to Machine Learning – 11 / 13


RRM VS. BAYES

L2 regularization corresponds to a zero-mean Gaussian prior with


constant variance on our parameters: θj ∼ N (0, τ 2 )
L1 corresponds to a zero-mean Laplace prior: θj ∼ Laplace(0, b).
1 |µ−x |
Laplace(µ, b) has density 2b exp(− b
), with scale parameter b,
2
mean µ and variance 2b .
In both cases, regularization strength increases as the variance of
the prior decreases: a prior probability mass more narrowly
concentrated around 0 encourages shrinkage.

c Introduction to Machine Learning – 12 / 13


EXAMPLE: BAYESIAN L2 REGULARIZATION
We can easily see the equivalence of L2 regularization and a Gaussian prior:

We define a Gaussian prior with uncorrelated components for θ :


d d
!
2
Y 2 2 − d2 1 X 2
q (θ) = Nd (0, diag (τ )) = N (0, τ ) = (2πτ ) exp − 2 θj .

j =1 j =1

With this, the MAP estimator becomes

θ̂MAP = arg minθ (− log p (y | θ, x) − log q (θ))


d
!
d 1 X 2
2
= arg minθ − log p (y | θ, x) + 2
log(2πτ ) + 2 θj

j =1
 
1
= arg minθ − log p (y | θ, x) + kθk22 .
2τ 2

We see how the inverse variance (precision) 1/τ 2 controls shrinkage.

c Introduction to Machine Learning – 13 / 13


Introduction to Machine Learning

Geometric Analysis of L2 Regularization


and Weight Decay

Learning goals
Have a geometric understanding
of L2 regularization
Understand why L2
regularization in combination
with gradient descent is called
weight decay
WEIGHT DECAY VS. L2 REGULARIZATION
Let us optimize the L2-regularized risk of a model f (x | θ)

λ
min Rreg (θ) = min Remp (θ) + ∥θ∥22
θ θ 2
by gradient descent. The gradient is

∇θ Rreg (θ) = ∇θ Remp (θ) + λθ.


We iteratively update θ by step size α times the negative gradient

 
θ [new] = θ [old] − α ∇θ Remp (θ [old] ) + λθ [old]
= θ [old] (1 − αλ) − α∇θ Remp (θ [old] ).

The term λθ [old ] causes the parameter (weight) to decay in proportion


to its size. This is a very well-known technique in deep learning - and
simply L2 regularization in disguise.

© Introduction to Machine Learning – 1 / 12


WEIGHT DECAY VS. L2 REGULARIZATION
When we use weight decay, we follow the steepest slope of Remp as for
gradient descent, but in every step, we are pulled back to the origin.

© Introduction to Machine Learning – 2 / 12


WEIGHT DECAY VS. L2 REGULARIZATION
How strongly we are pulled back to the origin for a fixed stepsize α
depends only on λ (as long as the procedure converges):

© Introduction to Machine Learning – 3 / 12


WEIGHT DECAY VS. L2 REGULARIZATION
Weight decay can be interpreted geometrically.

Let’s use a quadratic Taylor approximation of the unregularized


objective Remp (θ) in the neighborhood of its minimizer θ̂ ,

1
R̃emp (θ) = Remp (θ̂) + ∇θ Remp (θ̂) · (θ − θ̂) + (θ − θ̂)T H (θ − θ̂),
2

where H is the Hessian matrix of Remp (θ) evaluated at θ̂ .

The first-order term is 0 in the expression above because the


gradient is 0 at the minimizer.
H is positive semidefinite.

© Introduction to Machine Learning – 4 / 12


WEIGHT DECAY VS. L2 REGULARIZATION
The minimum of R̃emp (θ) occurs where ∇θ R̃emp (θ) = H (θ − θ̂) is 0.

Now we L2-regularize R̃emp (θ), such that

λ
R̃reg (θ) = R̃emp (θ) + ∥θ∥22
2

and solve this approximation of Rreg for the minimizer θ̂Ridge :

∇θ R̃reg (θ) = 0,
λθ + H (θ − θ̂) = 0,
(H + λI )θ = H θ̂,
θ̂Ridge = (H + λI )−1 H θ̂,

This give us a formula to see how the minimizer of the L2-regularized


version is a transformation of the minimizer of the unpenalized version.

© Introduction to Machine Learning – 5 / 12


WEIGHT DECAY VS. L2 REGULARIZATION
As λ approaches 0, the regularized solution θ̂Ridge approaches θ̂ .
What happens as λ grows?
Because H is a real symmetric matrix, it can be decomposed as
H = Q ΣQ ⊤ , where Σ is a diagonal matrix of eigenvalues and Q
is an orthonormal basis of eigenvectors.
Rewriting the transformation formula with this:
 − 1
θ̂Ridge = Q ΣQ ⊤ + λI Q ΣQ ⊤ θ̂
h i−1
= Q (Σ + λI )Q ⊤ Q ΣQ ⊤ θ̂

= Q (Σ + λI )−1 ΣQ ⊤ θ̂

Therefore, weight decay rescales θ̂ along the axes defined by the


eigenvectors of H. The component of θ̂ that is aligned with the j-th
σj
eigenvector of H is rescaled by a factor of σj +λ , where σj is the
corresponding eigenvalue.
© Introduction to Machine Learning – 6 / 12
WEIGHT DECAY VS. L2 REGULARIZATION
Firstly, θ̂ is rotated by Q ⊤ , which we can interpret as a projection of θ̂
on the rotated coordinate system defined by the principal directions of
H:

© Introduction to Machine Learning – 7 / 12


WEIGHT DECAY VS. L2 REGULARIZATION
Since, for λ = 0, the transformation matrix (Σ + λI )−1 Σ = Σ−1 Σ = I,
we simply arrive at θ̂ again after projecting back.

© Introduction to Machine Learning – 8 / 12


WEIGHT DECAY VS. L2 REGULARIZATION
If λ > 0, the component projected on the j-th axis gets rescaled by
σj
σj +λ before θ̂Ridge is rotated back.

© Introduction to Machine Learning – 9 / 12


WEIGHT DECAY VS. L2 REGULARIZATION
Along directions where the eigenvalues of H are relatively large, for
example, where σj >> λ, the effect of regularization is quite small.
On the other hand, components with σj << λ will be shrunk to
have nearly zero magnitude.
In other words, only directions along which the parameters
contribute significantly to reducing the objective function are
preserved relatively intact.
In the other directions, a small eigenvalue of the Hessian means
that moving in this direction will not significantly increase the
gradient. For such unimportant directions, the corresponding
components of θ are decayed away.

© Introduction to Machine Learning – 10 / 12


WEIGHT DECAY VS. L2 REGULARIZATION

Credit: Goodfellow et al. (2016), ch. 7

Figure: The solid ellipses represent the contours of the unregularized objective and
the dashed circles represent the contours of the L2 penalty. At θ̂Ridge , the competing
objectives reach an equilibrium.

In the first dimension, the eigenvalue of the Hessian of Remp (θ) is small. The
objective function does not increase much when moving horizontally away
from θ̂ . Therefore, the regularizer has a strong effect on this axis and θ1 is
pulled close to zero.

© Introduction to Machine Learning – 11 / 12


WEIGHT DECAY VS. L2 REGULARIZATION

Credit: Goodfellow et al. (2016), ch. 7

Figure: The solid ellipses represent the contours of the unregularized objective and
the dashed circles represent the contours of the L2 penalty. At θ̂Ridge , the competing
objectives reach an equilibrium.

In the second dimension, the corresponding eigenvalue is large indicating high


curvature. The objective function is very sensitive to movement along this axis
and, as a result, the position of θ2 is less affected by the regularization.

© Introduction to Machine Learning – 12 / 12


Introduction to Machine Learning

Early Stopping

Learning goals
Know how early stopping works
Understand how early stopping
acts as a regularizer
EARLY STOPPING
When training with an iterative optimizer such as SGD, it is
commonly the case that, after a certain number of iterations,
generalization error begins to increase even though training error
continues to decrease.
Early stopping refers to stopping the algorithm early before the
generalization error increases.

Figure: After a certain number of iterations, the algorithm begins to overfit.

© Introduction to Machine Learning – 1 / 4


EARLY STOPPING
How early stopping works:
1 Split training data Dtrain into Dsubtrain and Dval (e.g. with a ratio of
2:1).
2 Train on Dsubtrain and evaluate model using the validation set Dval .
3 Stop training when validation error stops decreasing (after a range
of “patience” steps).
4 Use parameters of the previous step for the actual model.
More sophisticated forms also apply cross-validation.

© Introduction to Machine Learning – 2 / 4


EARLY STOPPING
Strengths Weaknesses
Effective and simple Periodical evaluation of validation error
Applicable to almost any Temporary copy of θ (we have to save
model without adjustment the whole model each time validation
error improves)
Combinable with other Less data for training → include Dval
regularization methods afterwards

Relation between optimal early-stopping iteration Tstop and


weight-decay penalization parameter λ for step-size α (see
Goodfellow et al. (2016) page 251-252 for proof):
1 1
Tstop ≈ ⇔λ≈
αλ Tstop α

Small λ (low penalization) ⇒ high Tstop (complex model / lots of


updates).

© Introduction to Machine Learning – 3 / 4


EARLY STOPPING

Credit: Goodfellow et al. (2016)

Figure: An illustration of the effect of early stopping. Left: The solid contour lines
indicate the contours of the negative log-likelihood. The dashed line indicates the
trajectory taken by SGD beginning from the origin. Rather than stopping at the point θ̂
that minimizes the risk, early stopping results in the trajectory stopping at an earlier
point θ̂Ridge . Right: An illustration of the effect of L2 regularization for comparison. The
dashed circles indicate the contours of the L2 penalty which causes the minimum of the
total cost to lie closer to the origin than the minimum of the unregularized cost.

© Introduction to Machine Learning – 4 / 4


Deep Learning

Dropout and Augmentation

Learning goals
Recap: Ensemble Methods
Dropout
Augmentation
Ensemble Methods

Deep Learning – 1 / 22
RECAP: ENSEMBLE METHODS
Idea: Train several models separately, and average their
prediction (i.e. perform model averaging).
Intuition: This improves performance on test set, since different
models will not make the same errors.
Ensembles can be constructed in different ways, e.g.:
by combining completely different kind of models (using
different learning algorithms and loss functions).
by bagging: train the same model on k datasets, constructed
by sampling n samples from original dataset.
Since training a neural network repeatedly on the same dataset
results in different solutions (why?) it can even make sense to
combine those.

Deep Learning – 2 / 22
RECAP: ENSEMBLE METHODS

Figure: A cartoon description of bagging (Goodfellow et al. (2016))

Deep Learning – 3 / 22
Dropout

Deep Learning – 4 / 22
DROPOUT
Idea: reduce overfitting in neural networks by preventing complex
co-adaptations of neurons.
Method: during training, random subsets of the neurons are
removed from the network (they are "dropped out").This is done by
artificially setting the activations of those neurons to zero.
Whether a given unit/neuron is dropped out or not is completely
independent of the other units.
If the network has N (input/hidden) units, applying dropout to these
units can result in 2N possible ’subnetworks’.
Because these subnetworks are derived from the same ’parent’
network, many of the weights are shared.
Dropout can be seen as a form of "model averaging".

Deep Learning – 5 / 22
DROPOUT

Deep Learning – 6 / 22
DROPOUT

In each iteration, for each training example (in the forward pass), a
different (random) subset of neurons are dropped out.

Deep Learning – 7 / 22
DROPOUT

In each iteration, for each training example (in the forward pass), a
different (random) subset of neurons are dropped out.

Deep Learning – 8 / 22
DROPOUT

In each iteration, for each training example (in the forward pass), a
different (random) subset of neurons are dropped out.

Deep Learning – 9 / 22
DROPOUT: ALGORITHM
To train with dropout a minibatch-based learning algorithm such as
stochastic gradient descent is used.
For each training case in a minibatch, we randomly sample a
binary vector/mask µ with one entry for each input or hidden unit in
the network. The entries of µ are sampled independently from
each other.
The probability of sampling a mask value of 0 (dropout) for one
unit is a hyperparameter known as the ’dropout rate’.
A typical value for the dropout rate is 0.2 for input units and 0.5 for
hidden units.
Each unit in the network is multiplied by the corresponding mask
value resulting in a subnetµ .
Forward propagation, backpropagation, and the learning update
are run as usual.

Deep Learning – 10 / 22
DROPOUT: ALGORITHM
Algorithm 1 Training a (parent) neural network with dropout rate p
1: Define parent network and initialize weights
2: for each minibatch: do
3: for each training sample: do
4: Draw mask µ using p
5: Compute forward pass for subnetµ
6: end for
7: Update the weights of the (parent) network by performing a gradient descent step
with weight decay
8: end for

The derivatives wrt. each parameter are averaged over the training
cases in each mini-batch. Any training case which does not use a
parameter contributes a gradient of zero for that parameter.

Deep Learning – 11 / 22
DROPOUT: WEIGHT SCALING
The weights of the network will be larger than normal because of
dropout. Therefore, to obtain a prediction at test time the weights
must be first scaled by the chosen dropout rate.
This means that if a unit (neuron) is retrained with probability p
during training, the weight at test time of that unit is multiplied by p.

Credit: Srivastava et. al. (2014)

Weight scaling ensures that the expected total input to a


neuron/unit at test time is roughly the same as the expected total
input to that unit at train time, even though many of the units at
train time were missing on average

Deep Learning – 12 / 22
DROPOUT: WEIGHT SCALING
Rescaling of the weights can also be performed at training time
instead, after each weight update at the end of the mini-batch.
This is sometimes called ’inverse dropout’. Keras and PyTorch
deep learning libraries implement dropout in this way.

Deep Learning – 13 / 22
DROPOUT: EXAMPLE

To demonstrate how dropout


can easily improve
generalization we compute
neural networks with the
structure showed on the right.
Each neural network we fit has
different dropout probabilities,
a tuple where one probability
is for the input layer and one is
for the hidden layers. We
consider the tuples
(0; 0), (0.2; 0.2) and (0.6; 0.5).

Deep Learning – 14 / 22
DROPOUT: EXAMPLE
0.05

0.04

0.03
Dropout rate:
(Input; Hidden Layers)
test error

(0;0)
(0.2;0.2)
0.02 (0.6;0.5)

0.01

0.00

150 160 170 180 190 200


epochs

Dropout rate of 0 (no dropouts) leads to higher test error than dropping
some units out.

Deep Learning – 15 / 22
DROPOUT, WEIGHT DECAY OR BOTH?
0.20

0.15

Test error
comparison
test error

dropout
0.10
dropout + weight decay
unregularized
weight decay

0.05

0.00

0 100 200 300 400 500


epochs

Here, dropout leads to a smaller test error than using no regularization


or solely weight decay.

Deep Learning – 16 / 22
Dataset Augmentation

Deep Learning – 17 / 22
DATASET AUGMENTATION
Problem: low generalization because high ratio of

complexity of the model


#train data
Idea: artificially increase the train data.
Limited data supply → create “fake data”!
Increase variation in inputs without changing the labels.
Application:
Image and Object recognition (rotation, scaling, pixel
translation, flipping, noise injection, vignetting, color casting,
lens distortion, injection of random negatives)
Speech recognition (speed augmentation, vocal tract
perturbation)

Deep Learning – 18 / 22
DATASET AUGMENTATION

Figure: (Wu et al. (2015))

Deep Learning – 19 / 22
DATASET AUGMENTATION

Figure: (Wu et al. (2015))

⇒ careful when rotating digits (6 will become 9 and vice versa)!

Deep Learning – 20 / 22
REFERENCES
Ian Goodfellow, Yoshua Bengio and Aaron Courville (2016)
Deep Learning
http: // www. deeplearningbook. org/
Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky Ilya Sutskever and Ruslan
Salakhutdinov (2012)
Improving neural networks by preventing co-adaptation of feature detectors
http: // arxiv. org/ abs/ 1207. 0580
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky Ilya Sutskever and Ruslan
Salakhutdinov (2012)
Dropout: A Simple Way to Prevent Neural Networks from Overfitting
http: // jmlr. org/ papers/ v15/ srivastava14a. html
Wu Ren, Yan Shengen, Shan Yi, Dang Qingqing and Sun Gang (2015)
Deep Image: Scaling up Image Recognition
https: // arxiv. org/ abs/ 1501. 02876

Deep Learning – 21 / 22
Deep Learning

Challenges in Optimization

Learning goals
Ill-Conditioning
Local Minima
Saddle Points
Cliffs and Exploding Gradients
CHALLENGES IN OPTIMIZATION
In this section, we summarize several of the most prominent
challenges regarding training of deep neural networks.
Traditionally, machine learning ensures that the optimization
problem is convex by carefully designing the objective function and
constraints. But for neural networks we are confronted with the
general nonconvex case.
Furthermore, we will see in this section that even convex
optimization is not without its complications.

Deep Learning – 1 / 34
Ill-Conditioning

Deep Learning – 2 / 34
EFFECTS OF CURVATURE
Intuitively, the curvature of a function determines the outcome of a GD
step. . .

Source: Goodfellow et al., (2016), ch. 4

Figure: Quadratic objective function f (x) with various curvatures. The dashed line
indicates the first order taylor approximation based on the gradient information alone.
Left: With negative curvature, the cost function decreases faster than the gradient
predicts; Middle: With no curvature, the gradient predicts the decrease correctly; Right:
With positive curvature, the function decreases more slowly than expected and begins
to increase.

Deep Learning – 3 / 34
SECOND DERIVATIVE AND CURVATURE
To understand better how the curvature of a function influences the
outcome of a gradient descent step, let us recall how curvature is
described mathematically:
The second derivative corresponds to the curvature of the graph of
a function.
The Hessian matrix of a function R(θ) : Rm → R is the matrix of
second-order partial derivatives

∂2
Hij = R(θ).
∂θi ∂θj

Deep Learning – 4 / 34
SECOND DERIVATIVE AND CURVATURE
The second derivative in a direction d, with ∥d∥ = 1, is given by
d⊤H d.
What is the direction of the highest curvature (red direction), and
what is the direction of the lowest curvature (blue)?

Deep Learning – 5 / 34
SECOND DERIVATIVE AND CURVATURE
Since H is real and symmetric (why?), eigendecomposition yields
H = Vdiag(λ)V−1 with V and λ collecting eigenvectors and
eigenvalues, respectively.
It can be shown, that the eigenvector vmax with the max.
eigenvalue λmax points into the direction of highest curvature
⊤ Hv
(vmax max = λmax ), while the eigenvector vmin with the min.
eigenvalue λmin points into the direction of least curvature.

Deep Learning – 6 / 34
SECOND DERIVATIVE AND CURVATURE
At a stationary point θ , where the gradient is 0, we can examine
the eigenvalues of the Hessian to determine whether the θ is a
local maximum, minimum or saddle point:
∀i : λi > 0 (H positive definite at θ) ⇒ minimum at θ
∀i : λi < 0 (H negative definite at θ) ⇒ maximum at θ
∃ i : λi < 0 ∧ ∃j : λj > 0 (H indefinit at θ) ⇒ saddle point at θ

Credit: Rong Ge (2016)

Deep Learning – 7 / 34
ILL-CONDITIONED HESSIAN MATRIX
The condition number of a symmetric matrix A is given by the ratio of its
|λ |
min/max eigenvalues κ(A) = |λmax| . A matrix is called ill-conditioned, if
min
the condition number κ(A) is very high.

An ill-conditioned Hessian matrix means that the ratio of max. / min.


curvature is high, as in the example below:

Deep Learning – 8 / 34
CURVATURE AND STEP-SIZE IN GD
What does it mean for gradient descent if the Hessian is ill-conditioned?
Let us consider the second-order Taylor approximation as a local
approximation of the of R around a current point θ 0 (with gradient
g)

1
R(θ) ≈ T2 f (θ, θ 0 ) := R(θ 0 )+(θ −θ 0 )⊤ g + (θ −θ 0 )⊤H (θ −θ 0 )
2
Furthermore, Taylor’s theorem states (proof in Koenigsberger
(1997), p. 68)
R(θ) − T2 f (θ, θ 0 )
lim 0 2
=0
θ→θ 0 ||θ − θ ||

Deep Learning – 9 / 34
CURVATURE AND STEP-SIZE IN GD
One GD step with a learning rate α yields new parameters
θ 0 − αg and a new approximated loss value
1
R(θ 0 − αg) ≈ R(θ 0 ) − αg⊤ g + α2 g⊤H g .
2

Theoretically, if g⊤ Hg is positive, we can solve the equation above


for the optimal step size which corresponds to

g⊤ g
α∗ = .
g⊤H g

Deep Learning – 10 / 34
CURVATURE AND STEP-SIZE IN GD
Let us assume the gradient g points into the direction of vmax (i.e.
the direction of highest curvature), the optimal step size is given by

g⊤ g g⊤ g 1
α∗ = = = ,
g⊤H g λmax g⊤ g λmax
which is very small. Choosing a too large step-size is bad, as it
will make us “overshoot” the stationary point.
If, on the other hand, g points into the direction of the lowest
curvature, the optimal step size is

1
α∗ = ,
λmin
which corresponds to the largest possible optimal step-size.
We summarize: We want to perform big steps in directions of low
curvature, but small steps in directions of high curvature.

Deep Learning – 11 / 34
CURVATURE AND STEP-SIZE IN GD
But what if the gradient does not point into the direction of one of
the eigenvectors?
Let us consider the 2-dimensional case: We can decompose the
direction of g (black) into the two eigenvectors vmax and vmin
It would be optimal to perform a big step into the direction of the
smallest curvature vmin , but a small step into the direction of vmax ,
but the gradient points into a completely different direction.

Deep Learning – 12 / 34
ILL-CONDITIONING
GD is unaware of large differences in curvature, and can only walk
into the direction of the gradient.
Choosing a too large step-size will then cause the descent
direction change frequently (“jumping around”).
α needs to be small enough, which results in a low progress.

Deep Learning – 13 / 34
ILL-CONDITIONING
This effect is more severe, if a Hessian has a poor condition
number, i.e. the ratio between lowest and highest curvature is
large; gradient descent will perform poorly.

Figure: The contour lines show a quadratic risk function with a poorly conditioned
Hessian matrix. The plot shows the progress of gradient descent with a small step-size
vs. larger step-size. In both cases, convergence to the global optimum is rather slow.

Deep Learning – 14 / 34
ILL-CONDITIONING
In the worst case, ill-conditioning of the Hessian matrix and a too
big step-size will cause the risk to increase

1
R(θ 0 − αg) ≈ R(θ 0 ) − αg⊤ g + α2 g⊤H g ,
2
which happens if
1 2 ⊤
α g H g > αg⊤ g.
2
To determine whether ill-conditioning is detrimental to the training,
the squared gradient norm g⊤ g and the risk can be monitored.

Deep Learning – 15 / 34
ILL-CONDITIONING

Source: Goodfellow, ch. 6

Gradient norms increase over time, showing that the training


process is not converging to a stationary point g = 0.
At the same time, we observe that the risk is approx. constant, but
the gradient norm increases
1
R(θ 0 − αg) ≈ R(θ 0 ) − αg⊤ g + α2 g⊤H g .
| {z } | {z } 2 | {z }
approx. constant increase →increase

Deep Learning – 16 / 34
Local Minima

Deep Learning – 17 / 34
UNIMODAL VS. MULTIMODAL LOSS SURFACES

Figure: Left: Multimodal loss surface with saddle points; Right: (Nearly)
unimodal loss surface (Hao Li et al. (2017))

Deep Learning – 18 / 34
MULTIMODAL FUNCTION
Potential snippet from a loss surface of a deep neural network with
many local minima:

Deep Learning – 19 / 34
ONLY LOCALLY OPTIMAL MOVES
If the training algorithm makes only locally optimal moves (as in gradient
descent), it may move away from regions of much lower cost.

Source: Goodfellow, Ch. 8

In the figure above, initializing the parameter on the "wrong" side of the
hill will result in suboptimal performance.
In higher dimensions, however, it may be possible for gradient descent to
go around the hill but such a trajectory might be very long and result in
excessive training time.

Deep Learning – 20 / 34
LOCAL MINIMA
Weight space symmetry:
If we swap incoming weight vectors for neuron i and j and do the
same for the outcoming weights, modelled function stays
unchanged.
⇒ with n hidden units and one hidden layer there are n!
networks with the same empirical risk
If we multiply incoming weights of a ReLU neuron with β and
outcoming with 1/β the modelled function stays unchanged.
⇒ The empirical risk of a NN can have very many minima with
equivalent empirical risk.

Deep Learning – 21 / 34
LOCAL MINIMA
In practice only local minima with a high value compared to the
global minimium are problematic.

Source: Goodfellow, Ch. 4

Current literature suspects that most local minima have low


empirical risk.
Simple test: Norm of gradient should get close to zero.

Deep Learning – 22 / 34
Saddle Points

Deep Learning – 23 / 34
SADDLE POINTS
In optimization we look for areas with zero gradient.
A variant of zero gradient areas are saddle points.
For the empirical risk R of a neural network, the expected ratio of
the number of saddle points to local minima typically grows
exponentially with m
R : Rm → R
In other words: Networks with more parameters (deeper networks
or larger layers) exhibit a lot more saddle points than local minima.
Why is that?
The Hessian at a local minimum has only positive eigenvalues. At
a saddle point it is a mixture of positive and negative eigenvalues.

Deep Learning – 24 / 34
SADDLE POINTS
Imagine the sign of each eigenvalue is generated by coin flipping:
In a single dimension, it is easy to obtain a local minimum
(e.g. “head” means positive eigenvalue).
In an m-dimensional space, it is exponentially unlikely that all
m coin tosses will be head.
A property of many random functions is that eigenvalues of the
Hessian become more likely to be positive in regions of lower cost.
For the coin flipping example, this means we are more likely to
have heads m times if we are at a critical point with low cost.
That means in particular that local minima are much more likely to
have low cost than high cost and critical points with high cost are
far more likely to be saddle points.
See Dauphin et al. (2014) for a more detailed investigation.

Deep Learning – 25 / 34
SADDLE POINTS
“Saddle points are surrounded by high error plateaus that can
dramatically slow down learning, and give the illusory impression
of the existence of a local minimum” (Dauphin et al. (2014)).

Deep Learning – 26 / 34
SADDLE POINTS: EXAMPLE

f (x1 , x2 ) = x12 − x22

Along x1 , the function curves upwards (eigenvector of the Hessian


with positive eigenvalue). Along x2 , the function curves downwards
(eigenvector of the Hessian with negative eigenvalue).

Deep Learning – 27 / 34
SADDLE POINTS
So how do saddle points impair optimization?
First-order algorithms that use only gradient information might get
stuck in saddle points.
Second-order algorithms experience even greater problems when
dealing with saddle points. Newtons method for example actively
searches for a region with zero gradient. That might be another
reason why second-order methods have not succeeded in
replacing gradient descent for neural network training.

Deep Learning – 28 / 34
EXAMPLE: SADDLE POINT WITH GD

Red dot: Starting location

Deep Learning – 29 / 34
EXAMPLE: SADDLE POINT WITH GD

First step...

Deep Learning – 29 / 34
EXAMPLE: SADDLE POINT WITH GD

...second step...

Deep Learning – 29 / 34
EXAMPLE: SADDLE POINT WITH GD

...tenth step got stuck and cannot escape the saddle point!

Deep Learning – 29 / 34
Cliffs and Exploding Gradients

Deep Learning – 30 / 34
CLIFFS AND EXPLODING GRADIENTS
As a result from the multiplication of several parameters, the
emprirical risk for highly nonlinear deep neural networks often
contain sharp nonlinearities.
That may result in very high derivatives in some places.
As the parameters get close to such cliff regions, a gradient
descent update can catapult the parameters very far.
Such an occurrence can lead to losing most of the
optimization work that had been done.
However, serious consequences can be easily avoided using a
technique called gradient clipping.
The gradient does not specify the optimal step size, but only the
optimal direction within an infinitesimal region.

Deep Learning – 31 / 34
CLIFFS AND EXPLODING GRADIENTS
Gradient clipping simply caps the step size to be small enough that
it is less likely to go outside the region where the gradient indicates
the direction of steepest descent.
We simply “prune” the norm of the gradient at some threshold h:

h
if ||∇θ|| > h : ∇θ ← ∇θ
||∇θ||

Deep Learning – 32 / 34
EXAMPLE: CLIFFS AND EXPLODING GRADIENTS

Figure: “The objective function for highly nonlinear deep neural networks or
for recurrent neural networks often contains sharp nonlinearities in parameter
space resulting from the multiplication of several parameters. These
nonlinearities give rise to very high derivatives in some places. When the
parameters get close to such a cliff region, a gradient descent update can
catapult the parameters very far, possibly losing most of the optimization work
that had been done” (Goodfellow et al. (2016)).

Deep Learning – 33 / 34
REFERENCES
Ian Goodfellow, Yoshua Bengio and Aaron Courville (2016)
Deep Learning
http: // www. deeplearningbook. org/
Yann Dauphin, Razvan Pascanu, Çaglar Gülçehre, Kyunghyun Cho, Surya
Ganguli, Yoshua Bengio (2014)
Identifying and attacking the saddle point problem in high-dimensional non-convex
optimization
https: // arxiv. org/ abs/ 1406. 2572
Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, Tom Goldstein (2017)
Visualizing the Loss Landscape of Neural Nets
https: // arxiv. org/ abs/ 1712. 09913
Konrad Koenigsberger (1997)
Analysis 2, Springer

Rong Ge (2016)
Escaping from Saddle Points
http: // www. offconvex. org/ 2016/ 03/ 22/ saddlepoints/

Deep Learning – 34 / 34
Deep Learning

Advanced Optimization

Learning goals
SGD with Momentum
Learning Rate Schedules
Adaptive Learning Rates
Batch Normalization
Momentum

Deep Learning – 1 / 47
MOMENTUM
While SGD remains a popular optimization strategy, learning with it
can sometimes be slow.
Momentum is designed to accelerate learning, especially when
facing high curvature, small but consistent or noisy gradients.
Momentum accumulates an exponentially decaying moving
average of past gradients:
h1 X i
ν ← φν − α ∇θ L(y (i ) , f (x (i ) , θ))
m
i
| {z }

θ ← θ+ν

We introduce a new hyperparameter φ ∈ [0, 1), determining how


quickly the contribution of previous gradients decay.
ν is called “velocity” and derives from a physical analogy
describing how particles move through a parameter space
(Newton’s law of motion).
Deep Learning – 2 / 47
MOMENTUM
So far the step size was simply the gradient g multiplied by the
learning rate α.
Now, the step size depends on how large and how aligned a
sequence of gradients is. The step size grows when many
successive gradients point in the same direction.
Common values for φ are 0.5, 0.9 and even 0.99.
Generally, the larger φ is relative to the learning rate α, the more
previous gradients affect the current direction.
A very good website with an in-depth analysis of momentum:
https://distill.pub/2017/momentum/

Deep Learning – 3 / 47
MOMENTUM: EXAMPLE

ν1 ← φν0 − αg (θ[0] )
θ[1] ← θ[0] + φν0 − αg (θ[0] )
ν2 ← φν1 − αg (θ[1] )
= φ(φν0 − αg (θ[0] )) − αg (θ[1] )
θ[2] ← θ[1] + φ(φν0 − αg (θ[0] )) − αg (θ[1] )
ν3 ← φν2 − αg (θ[2] )
= φ(φ(φν0 − αg (θ[0] )) − αg (θ[1] )) − αg (θ[2] )
θ[3] ← θ[2] + φ(φ(φν0 − αg (θ[0] )) − αg (θ[1] )) − αg (θ[2] )
= θ[2] + φ3 ν0 − φ2 αg (θ[0] ) − φαg (θ[1] ) − αg (θ[2] )
= θ[2] − α(φ2 g (θ[0] ) + φ1 g (θ[1] ) + φ0 g (θ[2] )) + φ3 ν0
t
X
θ[t +1] = θ[t ] − α φj g (θ[t −j ] ) + φt +1 ν0
j =0

Deep Learning – 4 / 47
MOMENTUM: EXAMPLE
Suppose momentum always observes the same gradient g (θ):
t
X
θ[t +1] = θ[t ] − α φj g (θ[t −j ] ) + φt +1 ν0
j =0
t
X
= θ[t ] − αg (θ) φj + φt +1 ν0
j =0

1 − φ t +1
= θ[t ] − αg (θ) + φt +1 ν0
1−φ
1
→ θ[t ] − αg (θ) for t → ∞.
1−φ

Thus, momentum will accelerate in the direction of −g (θ) until reaching terminal
velocity with step size:

1
−αg (θ)(1 + φ + φ2 + φ3 + ...) = −αg (θ)
1−φ

E.g. a momentum with φ = 0.9 corresponds to multiplying the maximum speed by 10


relative to the gradient descent algorithm.

Deep Learning – 5 / 47
MOMENTUM: ILLUSTRATION
The vector ν3 (for ν0 = 0):
ν3 = φ(φ(φν0 − αg (θ[0] )) − αg (θ[1] )) − αg (θ[2] )
= −φ2 (αg (θ[0] )) − φ(αg (θ[1] )) − αg (θ[2] )

Figure: If consecutive (negative) gradients point mostly in the same direction, the
velocity "builds up". On the other hand, if consecutive (negative) gradients point in very
different directions, the velocity "dies down".

Deep Learning – 6 / 47
SGD WITH MOMENTUM

Algorithm Stochastic gradient descent with momentum


1: require learning rate α and momentum φ
2: require initial parameter θ and initial velocity ν
3: while stopping criterion not met do
4: Sample a minibatch of m examples fromP the training set {x̃ (1)
, . . . , x̃ (m) }
1 (i ) (i )
5: Compute gradient estimate: ĝ ← m ∇θ i L y , f x̃ | θ
6: Compute velocity update: ν ← φν − αĝ
7: Apply update: θ ← θ + ν
8: end while

Deep Learning – 7 / 47
SGD WITH MOMENTUM

Figure: The contour lines show a quadratic loss function with a poorly conditioned
Hessian matrix. The two curves show how standard gradient descent (black) and
momentum (red) learn when dealing with ravines. Momentum reduces the oscillation
and accelerates the convergence.

Deep Learning – 8 / 47
SGD WITH AND WITHOUT MOMENTUM
The following plot was created by our Shiny App. On the upper left you can explore
different predefined examples. Click here

Figure: Comparison of SGD with and without momentum on the Styblinkski-Tang


function. The black dot on the bottom left is the global optimum. We can see that SGD
without momentum (red line/points) cannot escape the local minimum, while SGD with
momentum (blue line/dots) is able to escape the local minimum and finds the global
minimum.

Deep Learning – 9 / 47
MOMENTUM IN PRACTICE

Lets try out different values of


momentum (with SGD) on the
MNIST data.
We apply the same
architecture we have used a
dozen of times already (note
that we used φ = 0.9 in all
computations so far, i.e. in
chapter 1 and 2)!

Deep Learning – 10 / 47
MOMENTUM IN PRACTICE

The higher momentum, the faster SGD learns the weights on the training data,
but if momentum is too large, the training and test error fluctuates.

Deep Learning – 11 / 47
MOMENTUM IN PRACTICE

The higher momentum, the faster SGD learns the weights on the training data,
but if momentum is too large, the training and test error fluctuates.

Deep Learning – 12 / 47
NESTEROV MOMENTUM
Momentum aims to solve poor conditioning of the Hessian but also
variance in the stochastic gradient.
Nesterov momentum modifies the algorithm such that the gradient
is evaluated after the current velocity is applied:
h1 X i
ν ← φν − α∇θ L(y (i ) , f (x (i ) , θ + φν))
m
i
θ ← θ+ν

We can interpret Nesterov momentum as an attempt to add a


correction factor to the basic method.
The method is also called Nesterov accelerated gradient (NAG).

Deep Learning – 13 / 47
SGD WITH NESTEROV MOMENTUM

Algorithm Stochastic gradient descent with Nesterov momentum


1: require learning rate α and momentum φ
2: require initial parameter θ and initial velocity ν
3: while stopping criterion not met do
4: Sample a minibatch of m examples from the training set {x̃ (1) , . . . , x̃ (m) }
5: Apply interim update: θ̃ ← θ + φν
Compute gradient estimate: ĝ ← m1 ∇θ̃ i L(y (i ) , f (x (i ) , θ̃))
P
6:
7: Compute velocity update: ν ← φν − αĝ
8: Apply update: θ ← θ + ν
9: end while

Deep Learning – 14 / 47
MOMENTUM VS. NESTEROV MOMENTUM

Credits: Chandra (2015)

Figure: Comparison GD with momentum (left) and GD with Nesterov momentum


(right) for one parameter θ . The first three updates of θ are very similar in both cases
and the updates become larger due to momentum (accumulation of previous negative
gradients). Update 4 is different. In case of momentum, the update overshoots as it
makes an even bigger step due to the gradient history. In contrast, Nesterov
momentum first evaluates a "look-ahead" point θlook_ahead , detects that it overshoots,
and slightly reduces the overall magnitude of the fourth update. Thus, Nesterov
momentum reduces overshooting and leads to smaller oscillations than momentum.

Deep Learning – 15 / 47
Learning Rates

Deep Learning – 16 / 47
LEARNING RATE
The learning rate is a very important hyperparameter.
To systematically find a good learning rate, we can start at a very
low learning rate and gradually increase it (linearly or
exponentially) after each mini-batch.
We can then plot the learning rate and the training loss for each
batch.
A good learning rate is one that results in a steep decline in the
loss.

Credit: jeremyjordan

Deep Learning – 17 / 47
LEARNING RATE SCHEDULE
We would like to force convergence until reaching a local minimum.
Applying SGD, we have to decrease the learning rate over time,
thus α[t ] (learning rate at training iteration t).
The estimator ĝ is computed based on small batches.
Random sampling m training samples introduces noise, that
does not vanish even if we find a minimum.
In practice, a common strategy is to decay the learning rate
linearly over time until iteration τ :
(  [0] [τ ] 
t
α[0] + τt α[τ ] = t − α +α + α[0] for t ≤ τ

1−
α [t ] = τ τ
α[τ ] for t > τ

Deep Learning – 18 / 47
LEARNING RATE SCHEDULE
Example for τ = 4:
iteration t t /τ α [t ]
α + α = 43 α[0] + 14 α[τ ]
[0] 1 [τ ]
1

1 0.25 1− 4 4
2 0 .5 2
4
α + α[τ ]
[0] 2
4
3 0.75 1
4
α + α[τ ]
[0] 3
4
4 1 0 + α[τ ]
... α[τ ]
t +1 α[τ ]

Deep Learning – 19 / 47
CYCLICAL LEARNING RATES
Another option is to have a learning rate that periodically varies
according to some cyclic function.
Therefore, if training does not improve the loss anymore (possibly
due to saddle points), increasing the learning rate makes it
possible to rapidly traverse such regions.
Recall, saddle points are far more likely than local minima in deep
nets.
Each cycle has a fixed length in terms of the number of iterations.

Deep Learning – 20 / 47
CYCLICAL LEARNING RATES
One such cyclical function is the "triangular" function.

Credit: Hafidz Zulkifli

In the right image, the range is cut in half after each cycle.

Deep Learning – 21 / 47
CYCLICAL LEARNING RATES
Yet another option is to abruptly "restart" the learning rate after a
fixed number of iterations.
Loshchilov et al. (2016) proposed "cosine annealing" (between
restarts).

Credit: Hafidz Zulkifli

Deep Learning – 22 / 47
Algorithms with Adaptive Learning Rates

Deep Learning – 23 / 47
ADAPTIVE LEARNING RATES
The learning rate is reliably one of the hyperparameters that is the
most difficult to set because it has a significant impact on the
models performance.
Naturally, it might make sense to use a different learning rate for
each parameter, and automatically adapt them throughout the
training process.

Deep Learning – 24 / 47
ADAGRAD
Adagrad adapts the learning rate to the parameters.
In fact, Adagrad scales learning rates inversely proportional to the
square root of the sum of the past squared derivatives.
Parameters with large partial derivatives of the loss obtain a
rapid decrease in their learning rate.
Parameters with small partial derivatives on the other hand
obtain a relatively small decrease in their learning rate.
For that reason, Adagrad might be well suited when dealing with
sparse data.
Goodfellow et al. (2016) say that the accumulation of squared
gradients can result in a premature and overly decrease in the
learning rate.

Deep Learning – 25 / 47
ADAGRAD
Algorithm Adagrad
1: require Global learning rate α
2: require Initial parameter θ
3: require Small constant β , perhaps 10−7 , for numerical stability
4: Initialize gradient accumulation variable r = 0
5: while stopping criterion not met do
6: Sample a minibatch of m examples from the training set {x̃ (1) , . . . , x̃ (m) }
1
∇θ i L y (i ) , f x̃ (i ) | θ
P 
7: Compute gradient estimate: ĝ ← m
8: Accumulate squared gradient r ← r + ĝ ⊙ ĝ
9: Compute update: ∇θ = − β+α√r ⊙ ĝ (division and square root applied element-wise)
10: Apply update: θ ← θ + ∇θ
11: end while

“⊙” is called Hadamard or element-wise product.


Example:
     
1 2 5 6 1·5 2·6
A= , B= , then A ⊙ B =
3 4 7 8 3·7 4·8

Deep Learning – 26 / 47
RMSPROP
RMSprop is a modification of Adagrad.
It’s intention is to resolve Adagrad’s radically diminishing learning
rates.
The gradient accumulation is replaced by an exponentially
weighted moving average.
Theoretically, that leads to performance gains in non-convex
scenarios.
Empirically, RMSProp is a very effective optimization algorithm.
Particularly, it is employed routinely by deep learning practitioners.

Deep Learning – 27 / 47
RMSPROP
Algorithm RMSProp
1: require Global learning rate α and decay rate ρ ∈ [0, 1)
2: require Initial parameter θ
3: require Small constant β , perhaps 10−6 , for numerical stability
4: Initialize gradient accumulation variable r = 0
5: while stopping criterion not met do
6: Sample a minibatch of m examples from the training set {x̃ (1) , . . . , x̃ (m) }

Compute gradient estimate: ĝ ← m1 ∇θ i L y (i ) , f x̃ (i ) | θ


P 
7:
8: Accumulate squared gradient r ← ρr + (1 − ρ)ĝ ⊙ ĝ
9: Compute update: ∇θ = − β+α√r ⊙ ĝ
10: Apply update: θ ← θ + ∇θ
11: end while

Deep Learning – 28 / 47
ADAM
Adaptive Moment Estimation (Adam) is another method that
computes adaptive learning rates for each parameter.
Adam uses the first and the second moments of the gradients.
Adam keeps an exponentially decaying average of past
gradients (first moment).
Like RMSProp it stores an exponentially decaying average of
past squared gradients (second moment).
Thus, it can be seen as a combination of RMSProp and
momentum.
Basically Adam uses the combined averages of previous gradients
at different moments to give it more “persuasive power” to
adaptively update the parameters.

Deep Learning – 29 / 47
ADAM
Algorithm Adam
1: require Step size α (suggested default: 0.001)
2: require Exponential decay rates for moment estimates, ρ1 and ρ2 in [0, 1) (suggested de-
faults: 0.9 and 0.999 respectively)
3: require Small constant β (suggested default 10−8 )
4: require Initial parameters θ
5: Initialize time step t = 0
6: Initialize 1st and 2nd moment variables s[0] = 0, r[0] = 0
7: while stopping criterion not met do
8: t ←t +1
9: Sample a minibatch of m examples from the training set {x̃ (1) , . . . , x̃ (m) }
1
Compute gradient estimate: ĝ[t ] ← m ∇θ i L y (i ) , f x̃ (i ) | θ
P 
10:
[t] [t −1]
11: Update biased first moment estimate: s ← ρ1 s + (1 − ρ1 )ĝ[t ]
12: Update biased second moment estimate: r[t ] ← ρ2 r[t −1] + (1 − ρ2 )ĝ[t ] ⊙ ĝ[t ]
s[t ]
13: Correct bias in first moment: ŝ ←
1−ρt1
r[t ]
14: Correct bias in second moment: r̂ ←
1−ρt2

15: Compute update: ∇θ = −α √ ŝ


r̂+β
16: Apply update: θ ← θ + ∇θ
17: end while

Deep Learning – 30 / 47
ADAM
Adam initializes the exponentially weighted moving averages s
and r as 0 (zero) vectors.
As a result, they are biased towards zero.
This means E[s[t ] ] ̸= E[ĝ[t ] ] and E[r[t ] ] ̸= E[ĝ[t ] ⊙ ĝ[t ] ] (where the
expectations are calculated over minibatches).
To see this, let us unroll the computation of s[t ] for a few
time-steps:
[0]
s =0
[1] [0]
s = ρ1 s + (1 − ρ1 )ĝ[1] = (1 − ρ1 )ĝ[1]
[2]
s = ρ1 s[1] + (1 − ρ1 )ĝ[2] = ρ1 (1 − ρ1 )ĝ[1] + (1 − ρ1 )ĝ[2]
[3]
s = ρ1 s[2] + (1 − ρ1 )ĝ[3] = ρ21 (1 − ρ1 )ĝ[1] + ρ1 (1 − ρ1 )ĝ[2] + (1 − ρ1 )ĝ[3]

ρt1−i g[i ] .
Pt
Therefore, s[t ] = (1 − ρ1 ) i =1
Note that the contribution of the earlier ĝ[i ] to the moving average
shrinks rapidly.

Deep Learning – 31 / 47
ADAM
The expected value of s[t ] is:
t
X
E[s[t ] ] = E[(1 − ρ1 ) ρt1−i ĝ[i ] ]
i =1
t
X
= E[ĝ[t ] ](1 − ρ1 ) ρt1−i + ζ
i =1

= E[ĝ[t ] ](1 − ρt1 ) + ζ

where we approximate ĝ[i ] with ĝ[t ] which allows us to move it


outside the sum. ζ is the error that results from this approximation.
Therefore, s[t ] is a biased estimator of ĝ[t ] and the effect of the
bias vanishes over the time-steps (because ρt1 → 0 for t → ∞).
Ignoring ζ (as it can be kept small), we correct for the bias by
s[t ]
setting ŝ[t ] = (1−ρ t .
)
1

r[t ]
Similarly, we set r̂[t ] = (1−ρt2 )
.

Deep Learning – 32 / 47
COMPARISON OF OPTIMIZERS: ANIMATION

Credits: Dettmers (2015) and Radford

Figure: Excerpts from an animation to compare the behavior of momentum and other
methods compared to SGD for a saddle point. Left: After a few seconds; Right: A bit
later. The animation shows that all showed methods accelerate optimization compared
to the standard SGD. The highest acceleration is obtained using Rmsprop followed by
Adagrad as learning rate strategies. You can find the animation here or click on the
images above.

Deep Learning – 33 / 47
Batch Normalization

Deep Learning – 34 / 47
BATCH NORMALIZATION
Batch Normalization (BatchNorm) is an extremely popular
technique that improves the training speed and stability of deep
neural nets.
It is an extra component that can be placed between each layer of
the neural network.
It works by changing the "distribution" of activations at each hidden
layer of the network.
We know that it is sometimes beneficial to normalize the inputs to
a learning algorithm by shifting and scaling all the features so that
they have 0 mean and unit variance.
BatchNorm applies a similar transformation to the activations of
the hidden layers (with a couple of additional tricks).

Deep Learning – 35 / 47
BATCH NORMALIZATION
For a hidden layer with neurons zj , j = 1, . . . , J, BatchNorm is
applied to each zj by considering the activations of zj over a given
minibatch of inputs.
(i )
Let zj denote the activation of zj for input x (i ) in the minibatch (of
size m).
The mean and variance of the activations are
m
1
P (i )
µj = m
zj
i
m
1 (i )
σj2 (zj − µj )2
P
= m
i
(i )
Each zj is then normalized
(i )
(i ) − µj zj
z̃j =q
σj2 + ϵ

where a small constant, ϵ, is added for numerical stability.


Deep Learning – 36 / 47
BATCH NORMALIZATION
It may not be desirable to normalize the activations in such a rigid
way because potentially useful information can be lost in the
process.
Therefore, we commonly let the training algorithm decide the "right
(i )
amount" of normalization by allowing it to re-shift and re-scale z̃j
(i )
to arrive at the batch normalized activation ẑj :

(i ) (i )
ẑj = γj z̃j + βj

γj and βj are learnable parameters that are also tweaked by


backpropogation.
(i )
ẑj then becomes the input to the next layer.
(i )
Note: The algorithm is free to scale and shift each z̃j back to its
original (unnormalized) value.

Deep Learning – 37 / 47
BATCH NORMALIZATION: ILLUSTRATION
Recall: zj = σ(WjT x + bj )
So far, we have applied batch-norm to the activation zj . It is
possible (and more common) to apply batch norm to WjT x + bj
before passing it to the nonlinear activation σ .

Figure: FC = Fully Connected layer. BatchNorm is applied before the nonlinear


activation function.

Deep Learning – 38 / 47
BATCH NORMALIZATION
The key impact of BatchNorm on the training process is this: It
reparametrizes the underlying optimization problem to make its
landscape significantly more smooth.
One aspect of this is that the loss changes at a smaller rate and
the magnitudes of the gradients are also smaller (see Santurkar et
al. 2018).

Deep Learning – 39 / 47
BATCH NORMALIZATION: PREDICTION
Once the network has been trained, how can we generate a
prediction for a single input (either at test time or in production)?
One option is to feed the entire training set to the (trained) network
and compute the means and standard deviations.
More commonly, during training, an exponentially weighted
running average of each of these statistics over the minibatches is
maintained.
The learned γ and β parameters are then used (in conjunction
with the running averages) to generate the output.

Deep Learning – 40 / 47
BATCH NORMALIZATION
For our final benchmark in this chapter we compute two models to
predict the mnist data.
One will extend our basic architecture such that we add batch
normalization to all hidden layers.
We use SGD as optimizer with a momentum of 0.9, a learning rate
of 0.03 and weight decay of 0.001.

Deep Learning – 41 / 47
BATCH NORMALIZATION

Batch Normalization accelerated learning.

Deep Learning – 42 / 47
BATCH NORMALIZATION

Batch Normalization resulted in a lower test error.

Deep Learning – 43 / 47
REFERENCES
Ian Goodfellow, Yoshua Bengio and Aaron Courville (2016)
Deep Learning
http: // www. deeplearningbook. org/
Yann Dauphin, Razvan Pascanu, Çaglar Gülçehre, Kyunghyun Cho, Surya
Ganguli, Yoshua Bengio (2014)
Identifying and attacking the saddle point problem in high-dimensional non-convex
optimization
https: // arxiv. org/ abs/ 1406. 2572
Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, Tom Goldstein (2017)
Visualizing the Loss Landscape of Neural Nets
https: // arxiv. org/ abs/ 1712. 09913
Tim Dettmers (2015)
Deep Learning in a Nutshell: History and Training
https: // devblogs. nvidia. com/
deep-learning-nutshell-history-training/

Deep Learning – 44 / 47
REFERENCES
Hafidz Zulkifli (2018)
Understanding Learning Rates and How It Improves Performance in Deep
Learning
https: // towardsdatascience. com
Ilya Loshchilov, Frank Hutter (2016)
SGDR: Stochastic Gradient Descent with Warm Restarts
https: // arxiv. org/ abs/ 1608. 03983
Jeremy Jordan (2018)
Setting the learning rate of your neural network
https: // www. jeremyjordan. me/ nn-learning-rate/
Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, Aleksander Madry (2018)
How Does Batch Normalization Help Optimization?
https: // arxiv. org/ abs/ 1805. 11604

Deep Learning – 45 / 47
REFERENCES
Akshay Chandra (2015)
Learning Parameters, Part 2: Momentum-Based & Nesterov Accelerated Gradient
Descent
https: // towardsdatascience. com/
learning-parameters-part-2-a190bef2d12

Deep Learning – 46 / 47
Deep Learning

Modern Activation Functions

Learning goals
Challenges in Optimization
related to Activation Functions
Activations for Hidden Units
Actications for Output Units
Hidden activations

Deep Learning – 1 / 17
HIDDEN ACTIVATIONS
Recall, hidden-layer activation functions make it possible for deep
neural nets to learn complex non-linear functions.
The design of hidden units is an extremely active area of research.
It is usually not possible to predict in advance which activation will
work best. Therefore, the design process often consists of trial and
error.
In the following, we will limit ourselves to the most popular
activations - Sigmoidal activation and ReLU.
It is possible for many other functions to perform as well as these
standard ones. An overview of further activations can be found
here .

Deep Learning – 2 / 17
SIGMOIDAL ACTIVATIONS
Sigmoidal functions such as tanh and the logistic sigmoid bound
the outputs to a certain range by "squashing" their inputs.

In each case, the function is only sensitive to its inputs in a small


neighborhood around 0.
Furthermore, the derivative is never greater than 1 and is close to
zero across much of the domain.

Deep Learning – 3 / 17
SIGMOIDAL ACTIVATION FUNCTIONS
1 Saturating Neurons:
We know: σ 0 (zin ) → 0 for |zin | → ∞.
→ Neurons with sigmoidal activations "saturate" easily, that is,
they stop being responsive when |zin |  0.

Deep Learning – 4 / 17
SIGMOIDAL ACTIVATION FUNCTIONS
2 Vanishing Gradients: Consider the vector of error signals δ (i ) in
layer i
 
(i ) (i +1) (i +1) 0 (i )
δ =W δ σ zin , i ∈ {1, ..., O }.

Each k -th component of the vector expresses how much the loss L
(i )
changes when the input to the k -th neuron zk ,in changes.
We know: σ 0 (z ) < 1 for all z ∈ R.
→ In each step of the recursive formula above, the value will be
multiplied by a value smaller than one
 
(i )
δ (1) = W(2) δ (2) σ 0 zin
    
(i ) (i )
= W(2) W(3) δ (3) σ 0 zin σ 0 zin
= ...

When this occurs, earlier layers train very slowly (or not at all).

Deep Learning – 5 / 17
RECTIFIED LINEAR UNITS (RELU)

The ReLU activation solves the vanishing gradient problem.

In regions where the activation is positive, the derivative is 1.


As a result, the derivatives do not vanish along paths that contain
such "active" neurons even if the network is deep.
Note that the ReLU is not differentiable at 0 (Software
implementations return either 0 or 1 for the derivative at this point).

Deep Learning – 6 / 17
RECTIFIED LINEAR UNITS (RELU)
ReLU units can significantly speed up training compared to units
with saturating activations.

Source : Krizhevsky et al. (2012)

Figure: A four-layer convolutional neural network with ReLUs (solid line) reaches a
25% training error rate on the CIFAR-10 dataset six times faster than an equivalent
network with tanh neurons (dashed line).

Deep Learning – 7 / 17
RECTIFIED LINEAR UNITS (RELU)
A downside of ReLU units is that when the input to the activation is
negative, the derivative is zero. This is known as the "dying ReLU
problem".

When a ReLU unit "dies", that is, when its activation is 0 for all
datapoints, it kills the gradient flowing through it during
backpropogation.
This means such units are never updated during training and the
problem can be irreversible.
Deep Learning – 8 / 17
GENERALIZATIONS OF RELU
There exist several generalizations of the ReLU activation that
have non-zero derivatives throughout their domains.
Leaky ReLU: (
v v ≥0
LReLU (v ) =
αv v < 0

Unlike the ReLU, when the input to the Leaky ReLU activation is
negative, the derivative is α which is a small positive value (such
as 0.01).
Deep Learning – 9 / 17
GENERALIZATIONS OF RELU
A variant of the Leaky ReLU is the Parametric ReLU (PReLU)
which learns the α from the data through backpropagation.
Exponential Linear Unit (ELU):
(
v v ≥0
ELU (v ) = v
α(e − 1) v < 0
Scaled Exponential Linear Unit (SELU):
(
v v ≥0
SELU (v ) = λ v
α(e − 1) v < 0
Note: In ELU and SELU, α and λ are hyperparameters that are set before training.

These generalizations may perform as well as or better than the


ReLU on some tasks.

Deep Learning – 10 / 17
GENERALIZATIONS OF RELU

Source : Zhang et al. 2018

Deep Learning – 11 / 17
Output activations

Deep Learning – 12 / 17
OUTPUT ACTIVATIONS
As we have seen previously, the role of the output activation is to
get the final score on the same scale as the target.
The output activations and the loss functions used to train neural
networks can be viewed through the lens of maximum likelihood
estimation (MLE).
In general, the function f (x | θ) represented by the neural network
defines the conditional p(y | x, θ) in a supervised learning task.
Maximizing the likelihood is then equivalent to minimizing
− log p(y | x, θ).
An output unit with the identity function as the activation can be
used to represent the mean of a Gaussian distribution.
For such a unit, training with mean-squared error is equivalent to
maximizing the log-likelihood (ignoring issues with non-convexity).

Deep Learning – 13 / 17
OUTPUT ACTIVATIONS
Similarly, sigmoid and softmax units can output the parameter(s) of
a Bernoulli distribution and Categorical distribution, respectively.
It is straightforward to show that when the label is one-hot
encoded, training with the cross-entropy loss is equivalent to
maximizing log-likelihood. Click here
Because these activations can saturate, an important advantage of
maximizing log-likelihood is that the log undoes some of the
exponentiation in the activation functions which is desirable when
optimizing with gradient-based methods.
For example, in the case of softmax, the loss is:
g
X
L(y , f (x)) = −fin,k + log exp(fin,k 0 )
k 0 =1

where k is the correct class. The first term, −fin,k , does not
saturate which means training can progress steadily even if the
contribution of fin,k to the second term is negligible.
Deep Learning – 14 / 17
OUTPUT ACTIVATIONS
A neural network can even be used to output the parameters of
more complex distributions.
A Mixture Density Network, for example, outputs the parameters of
a Gaussian Mixture Model:
m
X  
p(y |x) = φ(c ) (x) N y ; µ(c ) (x), Σ(c ) (x)
c =1

where m the number of components in the mixture.


In such a network, the output units are divided into groups.
One group of output neurons with softmax activation represents
the weights (φ(c ) ) of the mixture.
Another group with the identity activation represents the means
(µ(c ) ) and yet another group with a non-negative activation
function (such as ReLU or the exponential function) can represent
the variances of the (typically) diagonal covariance matrices Σ(c ) .
Deep Learning – 15 / 17
OUTPUT ACTIVATIONS

Source : Goodfellow et al. (2016)

Figure: Samples drawn from a Mixture Density Network. The input x is sampled from
a uniform distribution and y is sampled from p(y | x, θ).

Deep Learning – 16 / 17
REFERENCES
Ian Goodfellow, Yoshua Bengio and Aaron Courville (2016)
Deep Learning
http: // www. deeplearningbook. org/
Alex Krizhevsky, Ilya Sutskever and Geoffrey E. Hinton (2012)
ImageNet Classification with Deep Convolutional Neural Networks
https: // papers. nips. cc/ paper/
4824-imagenet-classification-with-deep-convolutional-neural-networks
pdf
Guoqiang Zhang and Haopeng Li (2018)
Effectiveness of Scaled Exponentially-Regularized Linear Units
https: // arxiv. org/ abs/ 1807. 10117

Deep Learning – 17 / 17
Deep Learning

Network Initializations

Learning goals
Why Initializaiton matters
Weight Initializations
Bias Initialization
PRACTICAL INITIALIZATION
The weights (and biases) of a neural network must be assigned
some initial values before training can begin.
The choice of the initial weights (and biases) is crucial as it
determines whether an optimization algorithm converges, how fast
and whether to a point with high or low risk.
Initialization strategies to achieve "nice" properties are difficult to
find, because there is no good understanding which properties are
preserved under which circumstances.
In the following we seperate between the initialization of weights
and biases.

Deep Learning – 1 / 10
WEIGHT INITIALIZATION
It is important to initialize the weights randomly in order to "break
symmetry". If two neurons (with the same activation function in a
fully connected network) are connected to the same inputs and
have the same initial weights, then both neurons will have the
same gradient update in a given iteration and they will end up
learning the same features.
Furthermore, the initial weights should not be too large, because
this might result in an explosion of weights or high sensitivity to
changes in the input.
Weights are typically drawn from a uniform distribution or a
Gaussian centered at 0 with a small variance.
Centering the initial weights around 0 can be seen as a form of
regularization and it imposses that it is more likely that units do not
interact with each other than they do interact.

Deep Learning – 2 / 10
WEIGHT INITIALIZATION
Two common initialization strategies for weights are the ’Glorot
initialization’ and ’He initialization’ which tune the variance of these
distributions based on the topology of the network.
Glorot initialization suggests to sample each weight of a fully
connected layer with m inputs and n outputs from a uniform
distribution r r !
6 6
wj ,k ∼ U − ,
m+n m+n

The strategy is derived from the assumption that the network


consists only of a chain of matrix multiplications with no
nonlinearities.

Deep Learning – 3 / 10
WEIGHT INITIALIZATION
He initialization is especially useful for neural networks with
ReLU activations. Each weight of a fully connected layer with m
inputs is sampled from a Gaussian distribution
 
2
wj ,k ∼ N 0,
m

The underlying derivation can be found in He et. al. (2015).


Since the initialization strategies of Glorot and He depend on the
layer sizes, the initial weights for large layer sizes can become
extremely small.
Another strategy is to treat the weights as hyperparameters that
can be optimized by hyperparameter search algorithms. This can
be computationally costly.

Deep Learning – 4 / 10
WEIGHT INITIALIZATION: EXAMPLE
We use a spiral planar data set to compare the following strategies:
Zero initialization,
 random initialization (samples from
N 0, 1 · 10 − 4
) and He initialization.
For each strategy, a neural network with one hidden layer with 100
units, ReLU activation and Gradient Descent as optimizer was
used.

Credit : Ghatak, ch. 4

Figure: Simulated spiral planar data set with two classes.

Deep Learning – 5 / 10
WEIGHT INITIALIZATION: EXAMPLE

Credit: Ghatak (2019), ch. 4

Figure: Decision boundary with zero initialization on the training data set (left)
and the testing data set (right). The zero initialization does not break symmetry
and the complexity of the network reduces to that of a single neuron.

Deep Learning – 6 / 10
WEIGHT INITIALIZATION: EXAMPLE

Credit: Ghatak (2019), ch. 4

Figure: Decision boundary with random initialization (N 0, 1 · 10−4 ) on the




training data set (left) and the testing data set (right).

Deep Learning – 7 / 10
WEIGHT INITIALIZATION: EXAMPLE

Credit: Ghatak (2019), ch. 4

Figure: Decision boundary with He initialization on the training data set (left)
and the testing data set (right).

Deep Learning – 8 / 10
BIAS INITIALIZATION
Typically, we set the biases for each unit to heuristically chosen
constants.
Setting the biases to zero is compatible with most weight
initialization schemes as the schemes expect a small bias.
However, deviations from 0 can be made individually, for example,
in order to obtain the right marginal statistics of the output unit or
to avoid causing too much saturation at the initialization.
For details see Goodfellow et. al (2016).

Deep Learning – 9 / 10
REFERENCES
Ian Goodfellow, Yoshua Bengio and Aaron Courville (2016)
Deep Learning
http: // www. deeplearningbook. org/
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun (2015)
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet
Classification. In Proceedings of the 2015 IEEE International Conference on
Computer Vision (ICCV) (ICCV ’15). IEEE Computer Society, Washington, DC,
USA, 1026-1034.
https: // arxiv. org/ abs/ 1502. 01852
Xavier Glorot and Yoshua Bengio (2010)
Understanding the difficulty of training deep feedforward neural networks
AISTATS, Volume 9 von JMLR Proceedings, Seite 249-256. JMLR.org
http: // proceedings. mlr. press/ v9/ glorot10a/ glorot10a. pdf? hc_
location= ufi
Abhijit Ghatak (2019)
Deep Learning with R. Springer.

Deep Learning – 10 / 10
Deep Learning

CNN: Introduction
Learning goals
What are CNNs?
When to apply CNNs?
A glimpse into CNN architectures
CONVOLUTIONAL NEURAL NETWORKS
Convolutional Neural Networks (CNN, or ConvNet) are a powerful
family of neural networks that are inspired by biological processes
in which the connectivity pattern between neurons resembles the
organization of the mamel visual cortex.

Figure: The ventral (recognition) pathway in the visual cortex has multiple
stage: Retina - LGN - V1 - V2 - V4 - PIT - AIT etc., which consist of lots of
intermediate representations.

Deep Learning – 1 / 11
CONVOLUTIONAL NEURAL NETWORKS
Since 2012, given their success in the ILSVRC competition, CNNs
are popular in many fields.
Common applications of CNN-based architectures in computer
vision are:
Image classification.
Object detection / localization.
Semantic segmentation.
CNNs are widely applied in other domains such as natural
language processing (NLP), audio, and time-series data.
Basic idea: a CNN automatically extracts visual, or, more
generally, spatial features from an input data such that it is able to
make the optimal prediction based on the extracted features.
It contains different building blocks and components.

Deep Learning – 2 / 11
CNNS - WHAT FOR?

Figure: All Tesla cars being produced now have full self-driving hardware
(Source: Tesla website). A convolutional neural network is used to map raw
pixels from a single front-facing camera directly into steering commands. The
system learns to drive in traffic, on local roads, with or without lane markings
as well as on highways.

Deep Learning – 3 / 11
CNNS - WHAT FOR?

Figure: Given an input image, a CNN is first used to get the feature map of the
last convolutional layer, then a pyramid parsing module is applied to harvest
different sub-region representations, followed by upsampling and
concatenation layers to form the final feature representation, which carries
both local and global context information. Finally, the representation is fed into
a convolution layer to get the final per-pixel prediction. (Source: pyramid scene
parsing network, by Zhao et. al, CVPR 2017)

Deep Learning – 4 / 11
CNNS - WHAT FOR?

Figure: Road segmentation (Mnih Volodymyr (2013)). Aerial images and


possibly outdated map pixels are segmented.

Deep Learning – 5 / 11
CNNS - WHAT FOR?
CNN for personalized medicine
Examples:
Tracking, diagnosis and
localization of Covid-19 patients.
CNN
based method (RADLogists)
for personalized Covid-19
detection: three CT scans from
a single Corona virus patient
diagnosed by RADLogists.

Deep Learning – 6 / 11
CNNS - WHAT FOR?

Figure: Four COVID-19 lung CT scans at the top with corresponding colored
maps showing Corona virus abnormalities at the bottom (Source: Megan
Scudellari, IEEE Spectrum 2021).

Deep Learning – 7 / 11
CNNS - WHAT FOR?

Figure: Various analyses in computational pathology are possible. For


example, nuclear segmentation in digital microscopic tissue images enable
extraction of high-quality features for nuclear morphometrics (Source: Kummar
et. al. IEEE Transaction Medical Imaging).

Deep Learning – 8 / 11
CNNS - WHAT FOR?

Figure: Image Colorization is another interesting application of CNN in


computer vision (Zhang et al. (2016)). Given a grayscale photo as the input
(top row), this network solves the problem of hallucinating a plausible color
version of the photo (bottom row, i.e. the prediction of the network).

Deep Learning – 9 / 11
CNNS - WHAT FOR?

Figure: Speech recognition (Anand & Verma (2015)). Convolutional neural


network is used to learn features from the audio data in order to classify
emotions.

Deep Learning – 10 / 11
CNNS - A FIRST GLIMPSE

Input layer takes input data (e.g. image, audio).


Convolution layers extract feature maps from the previous layers.
Pooling layers reduce the dimensionality of feature maps and
filter meaningful features.

Deep Learning – 11 / 11
CNNS - A FIRST GLIMPSE

Fully connected layers connect feature map elements to the


output neurons.
Softmax converts output values to probability scores.

Deep Learning – 11 / 11
Deep Learning

Convolutional Operation

Learning goals
What are filters?
Convolutional Operation
2D Convolution
FILTERS TO EXTRACT FEATURES
Filters are widely applied in Computer Vision (CV) since the 70’s.
One prominent example: Sobel-Filter.
It detects edges in images.

Figure: Sobel-filtered image.

Deep Learning – 1 / 9
FILTERS TO EXTRACT FEATURES
Edges occur where the intensity over neighboring pixels changes
fast.
Thus, approximate the gradient of the intensity of each pixel.
Sobel showed that the gradient image Gx of original image A in
x-dimension can be approximated by:
 
−1 0 +1
Gx = −2 0 +2 ∗ A = Sx ∗ A
−1 0 +1
where ∗ indicates a mathematical operation known as a
convolution, not a traditional matrix multiplication.
The filter matrix Sx consists of the product of an averaging and a
differentiation kernel:
 T  
1 2 1 −1 0 +1
| {z }| {z }
averaging differentiation

Deep Learning – 2 / 9
FILTERS TO EXTRACT FEATURES
Similarly, the gradient image Gy in y-dimension can be
approximated by:
 
−1 −2 −1
Gy =  0 0 0  ∗ A = Sy ∗ A
+1 +2 +1

The combination of both gradient images yields a


dimension-independent gradient information G:
q
G= G2x + G2y

These matrix operations were used to create the filtered picture of


Albert Einstein.

Deep Learning – 3 / 9
HORIZONTAL VS VERTICAL EDGES

Source: Wikipedia

Figure: Sobel filtered images. Outputs are normalized in each case.

Deep Learning – 4 / 9
FILTERS TO EXTRACT FEATURES

Let’s do this on a dummy image.


How to represent a digital image?

Deep Learning – 5 / 9
FILTERS TO EXTRACT FEATURES

Basically as an array of integers.

Deep Learning – 5 / 9
FILTERS TO EXTRACT FEATURES

Sx enables us to to detect vertical edges!

Deep Learning – 5 / 9
FILTERS TO EXTRACT FEATURES

Deep Learning – 5 / 9
FILTERS TO EXTRACT FEATURES

(Gx )(i ,j ) = (I ? Sx )(i ,j ) = −1 · 0 + 0 · 255 + 1 · 255


−2·0 + 0·0 + 2 · 255
− 1 · 0 + 0 · 255 + 1 · 255

Deep Learning – 5 / 9
FILTERS TO EXTRACT FEATURES

Applying the Sobel-Operator to every location in the input yields


the feature map.

Deep Learning – 5 / 9
FILTERS TO EXTRACT FEATURES

Normalized feature map reveals vertical edges.


Note the dimensional reduction compared to the dummy image.

Deep Learning – 5 / 9
WHY DO WE NEED TO KNOW ALL OF THAT?
What we just did was extracting pre-defined features from our
input (i.e. edges).
A convolutional neural network does almost exactly the same:
“extracting features from the input”.
⇒ The main difference is that we usually do not tell the CNN what
to look for (pre-define them), the CNN decides itself.
In a nutshell:
We initialize a lot of random filters (like the Sobel but just
random entries) and apply them to our input.
Then, a classifier which (e.g. a feed forward neural net) uses
them as input data.
Filter entries will be adjusted by common gradient descent
methods.

Deep Learning – 6 / 9
WHY DO WE NEED TO KNOW ALL OF THAT?

Deep Learning – 7 / 9
WHY DO WE NEED TO KNOW ALL OF THAT?

Deep Learning – 7 / 9
WORKING WITH IMAGES
In order to understand the functionality of CNNs, we have to
familiarize ourselves with some properties of images.
Grey scale images:
Matrix with dimensions height × width × 1.
Pixel entries differ from 0 (black) to 255 (white).
Color images:
Tensor with dimensions height × width × 3.
The depth 3 denotes the RGB values (red - green - blue).
Filters:
A filter’s depth is always equal to the input’s depth!
In practice, filters are usually square.
Thus we only need one integer to define its size.
For example, a filter of size 2 applied on a color image
actually has the dimensions 2 × 2 × 3.

Deep Learning – 8 / 9
THE 2D CONVOLUTION
Suppose we have an input with entries a, b, . . . , i (think of pixel
values).
The filter we would like to apply has weights w11 , w12 , w21 and w22 .

Deep Learning – 9 / 9
THE 2D CONVOLUTION
Suppose we have an input with entries a, b, . . . , i (think of pixel
values).
The filter we would like to apply has weights w11 , w12 , w21 and w22 .

Deep Learning – 9 / 9
THE 2D CONVOLUTION
Suppose we have an input with entries a, b, . . . , i (think of pixel
values).
The filter we would like to apply has weights w11 , w12 , w21 and w22 .

To obtain s11 we simply compute the dot product:


s11 = a · w11 + b · w12 + d · w21 + e · w22

Deep Learning – 9 / 9
THE 2D CONVOLUTION
Suppose we have an input with entries a, b, . . . , i (think of pixel
values).
The filter we would like to apply has weights w11 , w12 , w21 and w22 .

Same for s12 :


s12 = b · w11 + c · w12 + e · w21 + f · w22

Deep Learning – 9 / 9
THE 2D CONVOLUTION
Suppose we have an input with entries a, b, . . . , i (think of pixel
values).
The filter we would like to apply has weights w11 , w12 , w21 and w22 .

As well as for s21 :


s21 = d · w11 + e · w12 + g · w21 + h · w22

Deep Learning – 9 / 9
THE 2D CONVOLUTION
Suppose we have an input with entries a, b, . . . , i (think of pixel
values).
The filter we would like to apply has weights w11 , w12 , w21 and w22 .

And finally for s22 :


s22 = e · w11 + f · w12 + h · w21 + i · w22

Deep Learning – 9 / 9
THE 2D CONVOLUTION
Suppose we have an input with entries a, b, . . . , i (think of pixel
values).
The filter we would like to apply has weights w11 , w12 , w21 and w22 .

s11 = a · w11 + b · w12 + d · w21 + e · w22


s12 = b · w11 + c · w12 + e · w21 + f · w22
s21 = d · w11 + e · w12 + g · w21 + h · w22
s22 = e · w11 + f · w12 + h · w21 + i · w22

Deep Learning – 9 / 9
THE 2D CONVOLUTION
Suppose we have an input with entries a, b, . . . , i (think of pixel
values).
The filter we would like to apply has weights w11 , w12 , w21 and w22 .

More generally, let I be the matrix representing the input and W be


the filter/kernel. Then the entries of the output matrix are defined
P
by sij = m,n Ii +m−1,j +n−1 wmn where m, n denote the image size
and kernel size respectively.

Deep Learning – 9 / 9
Deep Learning

Properties of Convolution

Learning goals
Sparse Interactions
Parameter Sharing
Equivariance to Translation
SPARSE INTERACTIONS

We want to use the “neuron-wise” representation of our CNN.


Moving the filter to the first spatial location yields the first entry of
the feature map which is composed of these four connections.
Deep Learning – 1 / 7
SPARSE INTERACTIONS

Similarly...

Deep Learning – 1 / 7
SPARSE INTERACTIONS

Similarly...

Deep Learning – 1 / 7
SPARSE INTERACTIONS

and finally s22 by these and in total, we obtain 16 connections!

Deep Learning – 1 / 7
SPARSE INTERACTIONS

Assume we would replicate the architecture with a dense net.

Deep Learning – 1 / 7
SPARSE INTERACTIONS

Each input neuron is connected with each hidden layer neuron.

Deep Learning – 1 / 7
SPARSE INTERACTIONS

In total, we obtain 36 connections!

Deep Learning – 1 / 7
SPARSE INTERACTIONS
What does that mean?
Our CNN has a receptive field of 4 neurons.
That means, we apply a “local search” for features.
A dense net on the other hand conducts a “global search”.
The receptive field of the dense net are 9 neurons.
When processing images, it is more likely that features occur at
specific locations in the input space.
For example, it is more likely to find the eyes of a human in a
certain area, like the face.
A CNN only incorporates the surrounding area of the filter into
its feature extraction process.
The dense architecture on the other hand assumes that every
single pixel entry has an influence on the eye, even pixels far
away or in the background.

Deep Learning – 2 / 7
PARAMETER SHARING

For the next property we focus on the filter entries.

Deep Learning – 3 / 7
PARAMETER SHARING

In particular, we consider weight w11

Deep Learning – 3 / 7
PARAMETER SHARING

As we move the filter to the first spatial location..

Deep Learning – 3 / 7
PARAMETER SHARING

...we observe the following connection for weight w11

Deep Learning – 3 / 7
PARAMETER SHARING

Moving to the next location...

Deep Learning – 3 / 7
PARAMETER SHARING

...highlights that we use the same weight more than once!

Deep Learning – 3 / 7
PARAMETER SHARING

Even three...

Deep Learning – 3 / 7
PARAMETER SHARING

And in total four times.

Deep Learning – 3 / 7
PARAMETER SHARING

All together, we have just used four weights.

Deep Learning – 3 / 7
PARAMETER SHARING

How many weights does a corresponding dense net use?

Deep Learning – 3 / 7
PARAMETER SHARING

9  4 = 36! That is 9 times more weights!

Deep Learning – 3 / 7
SPARSE CONNECTIONS AND PARAMETER
SHARING
Why is that good?
Less parameters drastically reduce memory requirements.
Faster runtime:
For m inputs and n outputs, a fully connected layer requires
m  n parameters and has O (m  n) runtime.
A convolutional layer has limited connections k << m, thus
only k  n parameters and O (k  n) runtime.
Less parameters mean less overfitting and better generalization!

Deep Learning – 4 / 7
SPARSE CONNECTIONS AND PARAMETER
SHARING
Example: consider a color image with size 100  100.
Suppose we would like to create one single feature map with a
“same padding” (i.e. the hidden layer is of the same size).
Choosing a filter with size 5 means that we have a total of
5  5  3 = 75 parameters (bias unconsidered).
A dense net with the same amount of “neurons” in the hidden
layer results in

(1002  3)  (1002 ) = 300:000:000


| {z } | {z }
input hidden layer

parameters.
Note that this was just a fictitious example. In practice we normally
do not try to replicate CNN architectures with dense networks.

Deep Learning – 5 / 7
EQUIVARIANCE TO TRANSLATION

Think of a specific feature of interest, here highlighted in grey.

Deep Learning – 6 / 7
EQUIVARIANCE TO TRANSLATION

Furthermore, assume we had a tuned filter looking for exactly that


feature.

Deep Learning – 6 / 7
EQUIVARIANCE TO TRANSLATION

The filter does not care at what location the feature of interest is
located at.

Deep Learning – 6 / 7
EQUIVARIANCE TO TRANSLATION

It is literally able to find it anywhere! That property is called


equivariance to translation.
Note: A function f (x ) is equivariant to a function g if f (g (x )) = g (f (x )).

Deep Learning – 6 / 7
NONLINEARITY IN FEATURE MAPS
As in dense nets, we use activation functions on all feature map
entries to introduce nonlinearity in the net.
Typically rectified linear units (ReLU) are used in CNNs:
They reduce the danger of saturating gradients compared to
sigmoid activations.
They can lead to sparse activations, as neurons  0 are
squashed to 0 which increases computational speed.
As seen in the last chapter, many variants of ReLU (Leaky ReLU,
ELU, PReLU, etc.) exist.

Deep Learning – 7 / 7
Deep Learning

CNN Components

Learning goals
Input Channel
Padding
Stride
Pooling
INPUT CHANNEL

Figure: source: Chaitanya Belwal’s Blog

An image consists of the smallest indivisible segments called


pixels and every pixel has a strength often known as the pixel
intensity.
A grayscale image has a single input channel and value of each
pixel represents the amount of light.
Note a grayscale value can lie between 0 to 255, where 0 value
corresponds to black and 255 to white.

Deep Learning – 1 / 14
Figure: Image source: Computer Vision Primer: How AI Sees An Imag eKishan Maladkar’s Blog)

A colored digital image usually comes with three color channels,


i.e. the Red-Green-Blue channels, popularly known as the RGB
values.
Each pixel can be represented by a vector of three numbers (each
ranging from 0 to 255) for the three primary color channels.

Deep Learning – 2 / 14
VALID PADDING
Suppose we have an input of size 5  5 and a filter of size 2  2.

Deep Learning – 3 / 14
VALID PADDING
The filter is only allowed to move inside of the input space.

Deep Learning – 3 / 14
VALID PADDING
That will inevitably reduce the output dimensions.

 
In general, for an input of size i ( i ) and filter size k ( k ), the size of

the output feature map o ( o) claculated by:
o=i k +1

Deep Learning – 3 / 14
SAME PADDING
Suppose the following situation: an input with dimensions 5 5

and a filter with size 3 3.

Deep Learning – 4 / 14
SAME PADDING
We would like to obtain an output with the same dimensions as the
input.

Deep Learning – 4 / 14
SAME PADDING
Hence, we apply a technique called zero padding. That is to say
“pad” zeros around the input:

Deep Learning – 4 / 14
SAME PADDING
That always works! We just have to adjust the zeros according to
the input dimensions and filter size (ie. one, two or more rows).

Deep Learning – 4 / 14
PADDING AND NETWORK DEPTH

Figure: “Valid” versus “same” convolution. Top : Without padding, the width of
the feature map shrinks rapidly to 1 after just three convolutional layers (filter
width of 6 shown in each layer). This limits how deep the network can be
made. Bottom : With zero padding (shown as solid circles), the feature map
can remain the same size after each convolution which means the network
can be made arbitrarily deep. (Goodfellow, et al., 2016, ch. 9)
Deep Learning – 5 / 14
Strides

Deep Learning – 6 / 14
STRIDES
Stepsize “strides” of our filter (stride = 2 shown below).

Deep Learning – 7 / 14
STRIDES
Stepsize “strides” of our filter (stride = 2 shown below).

Deep Learning – 7 / 14
STRIDES
Stepsize “strides” of our filter (stride = 2 shown below).

In general, when there is no padding, for an input of size i, filter size k


and stride s, the size o of the output feature map is:
 
i k
o= +1
s

Deep Learning – 7 / 14
STRIDES AND DOWNSAMPLING

Figure: A strided convolution is equivalent to a convolution without strides


followed by downsampling (Goodfellow, et al., 2016, ch. 9).

Deep Learning – 8 / 14
MAX POOLING

We’ve seen how convolutions work, but there is one other


operation we need to understand.
We want to downsample the feature map but optimally lose no
information.

Deep Learning – 9 / 14
MAX POOLING

Applying the max pooling operation, we simply look for the


maximum value at each spatial location.
That is 8 for the first location.
Due to the filter of size 2 we have the dimensions of the original
feature map and obtain downsampling.

Deep Learning – 9 / 14
MAX POOLING

The final pooled feature map has entries 8, 6, 9 and 3.


Max pooling brings us 2 properties: 1) dimention reduction and 2)
spatial invariance.
Popular pooling functions: max and (weighted) average.

Deep Learning – 9 / 14
AVERAGE POOLING

We’ve seen how max pooling worked, there are exists other
pooling operation such as avg pooling, fractional pooling, LP
pooling, softmax pooling, stochastic pooling, blur pooling, global
average pooling, and etc.
Similar to max pooling, we downsample the feature map but
optimally lose no information.

Deep Learning – 10 / 14
AVERAGE POOLING

Applying the average pooling operation, we simply look for the


mean/average value at each spatial location.

Deep Learning – 10 / 14
AVERAGE POOLING

We use all information by Sum and backpropagated to all


responses.
It is not robust to noise.

Deep Learning – 10 / 14
AVERAGE POOLING

The final pooled feature map has entries 3.75, 2.5, 4.25 and 1.75.

Deep Learning – 10 / 14
COMPARISON OF MAX AND AVERAGE POOLING
Avg pooling use all information by sum but max pooling use only
highest value.
In max pooling operation details are removed therefore it is
suitable for sparse information (Image Classification) and avg
pooling is suitable for dense information (NLP).

Figure: Shortcomings of max and average pooling using Toy Image (photo
source: https://iq.opengenus.org/maxpool-vs-avgpool/)

Deep Learning – 11 / 14
Figure: CNNs use colored images where each of the Red, Green and Blue (RGB) color spectrums serve as input. (source:
Chaitanya Belwal’s Blog)

In this CNN:
there are 3 input channel, with the size of 4x4 as an input matrices,
one 2x2 filter (also known as kernel),
a single ReLu layer,
a single pooling layer (which applies the MaxPool function),

Deep Learning – 12 / 14
and a single fully connected (FC) layer.
The elements of the filter matrix are equivalent to the unit weights
in a standard NN and will be updated during the backpropagation
phase.
Assuming a stride of 2 with no padding, the output size of the
convolution layer is determined by the following equation:
I K +2:P
O= S + 1 where:
O: is the dimension (rows and columns) of the output square
matrix,
I: is the dimension (rows and columns) of the input square
matrix,
K: is the dimension (rows and columns) of the filter (kernel)
square matrix,
P: is the number of pixels(cells) of padding added to each
side of the input,

Deep Learning – 13 / 14
S: is the stride, or the number of cells skipped each time the
kernel is slided.

Inserting the values shown in the figure into the equation,

I K + 2:P (4 2 + 2:0)
O= +1= +1 (1)
S 2
=2 (2)

Deep Learning – 14 / 14
Introduction to Deep Learning

CNN Applications

Learning goals
Application of CNNs in Visual
Recognition
APPLICATION - IMAGE CLASSIFICATION
One use case of CNNs is image classification.
There exists a broad variety of network architectures for image
classification such as the LeNet, AlexNet, InceptionNet and
ResNet which will be discussed later in detail.
All these architectures rely on a set of sequences of convolutional
layers and aim to learn the mapping from an image to a probability
score over a set of classes.

WS 2021/2022 Introduction to Deep Learning – 1 / 18


APPLICATION - IMAGE CLASSIFICATION

Figure: Image classification with Cifar-10: famous benchmark dataset with


60000 images and 10 classes (Alex Krizhevsky (2009)). There is also a much
more difficult version with 60000 images and 100 classes.

WS 2021/2022 Introduction to Deep Learning – 2 / 18


APPLICATION - IMAGE CLASSIFICATION

Figure: One example of the Cifar-10 data: A highly pixelated, coloured image
of a frog with dimension [3, 32, 32].

WS 2021/2022 Introduction to Deep Learning – 3 / 18


APPLICATION - IMAGE CLASSIFICATION

Figure: An example of a CNN architecture for classification on the Cifar-10


dataset (FC = Fully Connected).

WS 2021/2022 Introduction to Deep Learning – 4 / 18


CNN VS A FULLY CONNECTED NET ON CIFAR-10

Figure: Performance of a CNN and a fully connected neural net ("Dense") on


Cifar-10. Both networks have roughly the same number of layers. They were
trained using the same learning rate, weight decay and dropout rate. The CNN
performs better as it has the right inductive bias for this task.

WS 2021/2022 Introduction to Deep Learning – 5 / 18


APPLICATION - BEYOND IMAGE CLASSIFICATION
There are many visual recognition problems that are related to image
classification, such as:
object detection
image captioning
semantic segmentation
visual question answering
visual instruction navigation
scene graph generation

WS 2021/2022 Introduction to Deep Learning – 6 / 18


APPLICATION - IMAGE COLORIZATION
Basic idea (introduced by Zhang et al., 2016):
Train the net on pairs of grayscale and colored images.
Force it to make a prediction on the color-value for each
pixel in the grayscale input image.
Combine the grayscale-input with the color-output to yield a
colorized image.
Very comprehensive material on the method is provided on the
author’s website. Click here

WS 2021/2022 Introduction to Deep Learning – 7 / 18


APPLICATION - IMAGE COLORIZATION

Figure: The CNN learns the mapping from grayscale (L) to color (ab) for each
pixel in the image. The L and ab maps are then concatenated to yield the
colorized image. The authors use the LAB color space for the image
representation.

WS 2021/2022 Introduction to Deep Learning – 8 / 18


APPLICATION - IMAGE COLORIZATION

Figure: The colour space (ab) is quantized in a total of 313 bins. This allows
to treat the color prediction as a classification problem where each pixel is
assigned a probability distribution over the 313 bins and the one with the
highest softmax score is taken as predicted color value. The bin is then
mapped back to the corresponding numeric (a,b) values. The network is
optimized using a multinomial cross-entropy loss over the 313 quantized (a,b)
bins.

WS 2021/2022 Introduction to Deep Learning – 9 / 18


APPLICATION - IMAGE COLORIZATION

Figure: The architecture consists of stacked CNN layers which are upsampled
towards the end of the net. It makes use of dilated convolutions and
upsampling layers which will be explained later. The output is a tensor of
dimension [64, 64, 313] that stores the 313 probabilities for each element of
the final, downsampled 64x64 feature maps.

WS 2021/2022 Introduction to Deep Learning – 10 / 18


APPLICATION - IMAGE COLORIZATION

Figure: This block is then upsampled to a dimension of 224x224 and the


predicted color bins are mapped back to the (a,b) values yielding a depth of 2.
Finally, the L and the ab maps are concatenated to yield a colored image.

WS 2021/2022 Introduction to Deep Learning – 11 / 18


APPLICATION - OBJECT LOCALIZATION
Until now, we used CNNs for single-class classification of images -
which object is on the image?
Now we extend this framework - is there an object in the image
and if yes, where and which?

Figure: Classify and detect the location of the cat.

WS 2021/2022 Introduction to Deep Learning – 12 / 18


APPLICATION - OBJECT LOCALIZATION
Bounding boxes can be defined by the location of the left lower
corner as well as the height and width of the box: [bx , by , bh , bw ].
We now combine three tasks (detection, classification and
localization) in one architecture.
This can be done by adjusting the label output of the net.
Imagine a task with three classes (cat, car, frog).
In standard classification we would have:

P (y = cat jX ;  )
   
ccat
label vector ccar and softmax output P (y = car jX ;  ) 
  
cfrog P (y = frog jX ;  )

WS 2021/2022 Introduction to Deep Learning – 13 / 18


APPLICATION - OBJECT LOCALIZATION
We include the information, if there is a object as well as the
bounding box parametrization in the label vector.
This gives us the following label vector:
   
bx x coordinate box
 by   y coordinate box 
   
 bh   height box 
 bw :   width box:
   

 = 
 co  presence of object, binary
   
 ccat   class cat, one-hot 
   
 ccar   class car, one-hot 
cfrog class frog, one-hot

WS 2021/2022 Introduction to Deep Learning – 14 / 18


APPLICATION - OBJECT LOCALIZATION

Naive approach: use a CNN with two heads, one for the class
classification and one for the bounding box regression.
But: What happens, if there are two cats in the image?
Different approaches: "Region-based" CNNs (R-CNN, Fast R-CNN
and Faster R-CNN) and "single-shot" CNNs (SSD and YOLO).

WS 2021/2022 Introduction to Deep Learning – 15 / 18


SEMANTIC SEGMENTATION
The goal of semantic image segmentation is to label each pixel of an
image with a corresponding class of what is being represented.

Figure:

U-Net, Ronneberger et al. MICCAI 2015

WS 2021/2022 Introduction to Deep Learning – 16 / 18


IMAGE CAPTIONING
The goal of image captioning is to convert a given input image into a
natural language description.

Figure:
Figure:

“a group of people riding on top of an elephant”: Caption generation


with augmented visual attention by Biswas et al., 2020

WS 2021/2022 Introduction to Deep Learning – 17 / 18


VISUAL QUESTION ANSWERING
Visual Question Answering is a research area about building a
computer system to answer questions presented in an image and a
natural language.

Figure: Combining CNN/RNN for VQA (source:


https://towardsdatascience.com/)

WS 2021/2022 Introduction to Deep Learning – 18 / 18


Deep Learning

1D / 2D / 3D Convolutions

Learning goals
1D Convolutions
2D Convolutions
3D Convolutions
1D Convolutions

Deep Learning – 1 / 22
1D CONVOLUTIONS
Data situation: Sequential, 1-dimensional tensor data.
Data consists of tensors with shape [depth, xdim]
Depth 1 (single-channel):
Univariate time series, e.g. development of a single stock
price over time
Functional / curve data
Depth > 1 (mutli-channel):
Multivariate time series, e.g.
Movement data measured with multiple sensors for
human activity recognition
Temperature and humidity in weather forecasting
Text encoded as character-level one-hot-vectors
→ Convolve the data with a 1D-kernel

Deep Learning – 2 / 22
1D CONVOLUTIONS – OPERATION

Figure: Illustration of 1D movement data with depth 1 and filter size 1.

Deep Learning – 3 / 22
1D CONVOLUTIONS – OPERATION

Figure: Illustration of 1D movement data with depth 1 and filter size 2.

Deep Learning – 4 / 22
1D CONVOLUTIONS – SENSOR DATA

Figure: Illustration of 1D movement data with depth 3 measured with an


accelerometer sensor belonging to a human activity recognition task.

Deep Learning – 5 / 22
1D CONVOLUTIONS – SENSOR DATA

Figure: Time series classification with 1D CNNs and global average pooling
(explained later). An input time series is convolved with 3 CNN layers, pooled
and fed into a fully connected layer before the final softmax layer. This is one
of the classic time series classification architectures.

Deep Learning – 6 / 22
1D CONVOLUTIONS – TEXT MINING
1D convolutions also have an interesting application in text mining.
For example, they can be used to classify the sentiment of text
snippets such as yelp reviews.

Figure: Sentiment classification: can we teach the net that this a positive
review?

Deep Learning – 7 / 22
1D CONVOLUTIONS – TEXT MINING

We use a given alphabet to encode the text reviews (here:


“dummy review” ).
Each character is transformed into a one-hot vector. The vector for
character d contains only 0’s at all positions except for the 4th
position.
The maximum length of each review is set to 1014: shorter texts
are padded with spaces (zero-vectors), longer texts are simply cut.

Deep Learning – 8 / 22
1D CONVOLUTIONS – TEXT MINING

The data is represented as 1D signal with depth = size of the


alphabet .
Deep Learning – 8 / 22
1D CONVOLUTIONS – TEXT MINING

The temporal dimension is shown as the y dimension for


illustrative purposes.
Deep Learning – 8 / 22
1D CONVOLUTIONS – TEXT MINING

The 1D-kernel (blue) convolves the input in the temporal


y-dimension yielding a 1D feature vector.
Deep Learning – 8 / 22
ADVANTAGES OF 1D CONVOLUTIONS
For certain applications 1D CNNs are advantageous and thus
preferable to their 2D counterparts:
Computational complexity: Forward propagation and backward
propagation in 1D CNNs require simple array operations.
Training is easier: Recent studies show that 1D CNNs with
relatively shallow architectures are able to learn challenging tasks
involving 1D signals.
Hardware: Usually, training deep 2D CNNs requires special
hardware setup (e.g. Cloud computing). However, any CPU
implementation over a standard computer is feasible and relatively
fast for training compact 1D CNNs.
Application: Due to their low computational requirements, compact
1D CNNs are well-suited for real-time and low-cost applications
especially on mobile or hand-held devices.

Deep Learning – 9 / 22
2D Convolutions

Deep Learning – 10 / 22
2D CONVOLUTIONS
The basic idea behind a 2D convolution is sliding a small window
(called a "kernel/filter") over a larger 2D array, and performing a dot
product between the filter elements and the corresponding input array
elements at every position.

Figure: Here’s a diagram demonstrating the application of a 2×2 convolution filter to a 5×5 array, in 16 different positions.

Deep Learning – 11 / 22
2D CONVOLUTIONS – EXAMPLE

In Deep Learning, convolution is the element-wise multiplication


and addition.
For an image with 1 channel, the convolution is demonstrated in
the figure below. Here the filter is a 2×2 matrix with element [[0,
1], [2, 2]].

Deep Learning – 12 / 22
2D CONVOLUTIONS – EXAMPLE

The filter is sliding through the input.


We move/convolve filter on input neurons to create a feature map.

Deep Learning – 12 / 22
2D CONVOLUTIONS – EXAMPLE

Notice that stride is 1 and padding is 0 in this example.

Deep Learning – 12 / 22
2D CONVOLUTIONS – EXAMPLE

Each sliding position ends up with one number. The final output is
then a 2 × 2 matrix.

Deep Learning – 12 / 22
3D Convolutions

Deep Learning – 13 / 22
3D CONVOLUTIONS
Data situation: 3-dimensional tensor data.
Data consists of tensors with shape [depth, xdim, ydim, zdim].
Dimensions can be both temporal (e.g. video frames) or spatial
(e.g. MRI)
Examples:
Human activity recognition in video data
Disease classification or tumor segmentation on MRI scans
Solution: Move a 3D-kernel in x, y and z direction to capture all
important information.

Deep Learning – 14 / 22
3D CONVOLUTIONS – DATA

Figure: Illustration of depth 1 volumetric data: MRI scan. Each slice of the
stack has depth 1, as the frames are black-white.

Deep Learning – 15 / 22
3D CONVOLUTIONS – DATA

Figure: Illustration of volumetric data with depth > 1: video snippet of an


action detection task. The video consists of several slices, stacked in temporal
order. Frames have depth 3, as they are RGB.

Deep Learning – 16 / 22
3D CONVOLUTIONS

Note: 3D convolutions yield a 3D output.

Deep Learning – 17 / 22
3D CONVOLUTIONS

Figure: Basic 3D-CNN architecture.

Basic architecture of the CNN stays the same.


3D convolutions output 3D feature maps which are element-wise
activated and then (eventually) pooled in 3 dimensions.

Deep Learning – 18 / 22
REFERENCES
Dumoulin, Vincent and Visin, Francesco (2016)
A guide to convolution arithmetic for deep learning
https: // arxiv. org/ abs/ 1603. 07285v1
Van den Oord, Aaron, Sander Dielman, Karen Simonyan, Oriol Vinyals, Alex
Graves, Nal Kalchbrenner, and Koray Kavukocuoglu (2016)
WaveNet: A Generative Model for Raw Audio
https: // arxiv. org/ abs/ 1609. 03499
Benoit A., Gennart, Bernard Krummenacher, Roger D. Hersch, Bernard Saugy,
J.C. Hadorn and D. Mueller (1996)
The Giga View Multiprocessor Multidisk Image Server
https: // www. researchgate. net/ publication/ 220060811_ The_ Giga_
View_ Multiprocessor_ Multidisk_ Image_ Server
Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani and Paluri Manohar
(2015)
Learning Spatiotemporal Features with 3D Convolutional Networks
https: // arxiv. org/ pdf/ 1412. 0767. pdf

Deep Learning – 19 / 22
REFERENCES
Milletari, Fausto, Nassir Navab and Seyed-Ahmad Ahmadi (2016)
V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image
Segmentation
https: // arxiv. org/ pdf/ 1606. 04797. pdf
Zhang, Xiang, Junbo Zhao and Yann LeCun (2015)
Character-level Convolutional Networks for Text Classification
http: // arxiv. org/ abs/ 1509. 01626
Wang, Zhiguang, Weizhong Yan and Tim Oates (2017)
Time Series Classification from Scratch with Deep Neural Networks: A Strong
Baseline
http: // arxiv. org/ abs/ 1509. 01626
Fisher Yu and Vladlen Koltun (2015)
Multi-Scale Context Aggregation by Dilated Convolutions
https: // arxiv. org/ abs/ 1511. 07122

Deep Learning – 20 / 22
REFERENCES
Bai, Shaojie, Zico J. Kolter and Vladlen Koltun (2018)
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for
Sequence Modeling
http: // arxiv. org/ abs/ 1509. 01626
Augustus Odena, Vincent Dumoulin and Chris Olah (2016)
Deconvolution and Checkerboard Artifacts
https: // distill. pub/ 2016/
deconv-checkerboard/ https://distill.pub/2016/deconv-checkerboard/
Andre Araujo, Wade Norris and Jack Sim (2019)
Computing Receptive Fields of Convolutional Neural Networks
https: // distill. pub/ 2019/ computing-receptive-fields/
Zhiguang Wang, Yan, Weizhong and Tim Oates (2017)
Time series classification from scratch with deep neural networks: A strong
baseline
https: // arxiv. org/ 1611. 06455

Deep Learning – 21 / 22
REFERENCES
Lin, Haoning and Shi, Zhenwei and Zou, Zhengxia (2017)
Maritime Semantic Labeling of Optical Remote Sensing Images with Multi-Scale
Fully Convolutional Network

Deep Learning – 22 / 22
Deep Learning

Important Types of Convolutions

Learning goals
Dilated Convolutions
Transposed Convolutions
Dilated Convolutions

Deep Learning – 1 / 26
DILATED CONVOLUTIONS
Idea : artificially increase the receptive field of the net without
using more filter weights.
The receptive field of a single neuron comprises all inputs that
have an impact on this neuron.
Neurons in the first layers capture less information of the input,
while neurons in the last layers have huge receptive fields and can
capture a lot more global information from the input.
The size of the receptive fields depends on the filter size.

Figure: Receptive field of each convolution layer with a 3 × 3 kernel. The


orange area marks the receptive field of one pixel in Layer 2, the yellow area
marks the receptive field of one pixel in layer 3.

Deep Learning – 2 / 26
DILATED CONVOLUTIONS
Intuitively, neurons in the first layers capture less information of the
input (layer), while neurons in the last layers have huge receptive
fields and can capture a lot more global information from the input
(layer).
The size of the receptive fields depends on the filter size.

Figure: A convolutional neural network, convolved with 3 layers with 3 × 3


kernels. The orange area marks the receptive field of one neuron in Layer 2
w.r.t. the input layer (size 9), the yellow area marks the receptive field of one
pixel in layer 3.

Deep Learning – 3 / 26
DILATED CONVOLUTIONS
By increasing the filter size, the size of the receptive fields
increases as well and more contextual information can be
captured.
However, increasing the filter size increases the number of
parameters, which leads to increased runtime.
Artificially increase the receptive field of the net without using more
filter weights by adding a new dilation parameter to the kernel that
skips pixels during convolution.
Benefits:
Capture more contextual information.
Enable the processing of inputs in higher dimensions to
detect fine details.
Improved run-time-performance due to less parameters.

Deep Learning – 4 / 26
DILATED CONVOLUTIONS
Useful in applications where the global context is of great
importance for the model decision.
This component finds application in:
Generation of audio-signals and songs within the famous
Wavenet developed by DeepMind.
Time series classification and forecasting.
Image segmentation.

Figure: Dilated convolution on 2D data. A dilated kernel is a regular


convolutional kernel interleaved with zeros.

Deep Learning – 5 / 26
DILATED CONVOLUTIONS

Figure: Simple 1D convolutional network with convolutional kernel of size 2,


stride 2 and fixed weights {0.5, 1.0}.
The kernel is not dilated (dilation factor 1). One neuron in layer 2 has a
receptive field of size 4 w.r.t. the input layer.

Deep Learning – 6 / 26
DILATED CONVOLUTIONS

Figure: Simple 1D convolutional network with convolutional kernel of size 2,


stride 2 and fixed weights {0.5, 1.0}.
The kernel is dilated with dilation factor 2. One neuron in layer 2 has a
receptive field of size 7 w.r.t. the input layer.

Deep Learning – 7 / 26
DILATED CONVOLUTIONS

Figure: Application of (a variant of) dilated convolutions on time series for


classification or seq2seq prediction (e.g. machine translation). Given an input
sequence x0 , x1 , . . . , xT , the model generates an output sequence
ŷ0 , ŷ1 , . . . , ŷT . Dilation factors d = 1, 2, 4 shown above, each with a kernel
size k = 3. The dilations are used to drastically increase the context
information for each output neuron with relatively few layers.

Deep Learning – 8 / 26
DILATED CONVOLUTIONS

Deep Learning – 9 / 26
Transposed Convolutions

Deep Learning – 10 / 26
TRANSPOSED CONVOLUTIONS
Problem setting:
For many applications and in many network architectures, we
often want to do transformations going in the opposite
direction of a normal convolution, i.e. we would like to perform
up-sampling.
examples include generating high-resolution images and
mapping low dimensional feature map to high dimensional
space such as in auto-encoder or semantic segmentation.
Instead of decreasing dimensionality as with regular convolutions,
transposed convolutions are used to re-increase dimensionality
back to the initial dimensionality.
Note: Do not confuse this with deconvolutions (which are
mathematically defined as the inverse of a convolution).

Deep Learning – 11 / 26
TRANSPOSED CONVOLUTIONS
Example 1:
Input: yellow feature map with dim 4 × 4.
Output: blue feature map with dim 2 × 2.

Figure: A regular convolution with kernel-size k = 3, padding p = 0 and stride


s = 1.

Here, the feature map shrinks from 4 × 4 to 2 × 2.

Deep Learning – 12 / 26
TRANSPOSED CONVOLUTIONS
One way to upsample is to use a regular convolution with various
padding strategies.

Figure: Transposed convolution can be a seen as a regular convolution.


Convolution (above) with k ′ = 3, s′ = 1, p′ = 2 re-increases dimensionality
from 2 × 2 to 4 × 4

Deep Learning – 13 / 26
TRANSPOSED CONVOLUTIONS
Convolution with parameters kernel size k , stride s and padding
factor p
Associated transposed convolution has parameters k ′ = k , s′ = s
and p′ = k − 1

Deep Learning – 14 / 26
TRANSPOSED CONVOLUTIONS
Example 2 : Convolution as a matrix multiplication :

credit:Stanford University

Figure: A "regular" 1D convolution. stride = 1 , padding = 1. The vector a is


the 1D input feature map.

Deep Learning – 15 / 26
TRANSPOSED CONVOLUTIONS
Example 2 : Transposed Convolution as a matrix multiplication :

credit:Stanford University

Figure: "Transposed" convolution upsamples a vector of length 4 to a vector of


length 6. Stride is 1. Note the change in padding.

Important : Even though the "structure" of the matrix here is the transpose of
the original matrix, the non-zero elements are, in general, different from the
correponding elements in the original matrix. These (non-zero)
elements/weights are tuned by backpropagation.

Deep Learning – 16 / 26
TRANSPOSED CONVOLUTIONS
Example 3: Convolution as matrix multiplication:

Figure: A regular 1D convolution with stride = 1 ,and padding = 0. The vector


z is in the input feature map. The matrix K represents the convolution
operation.

A regular convolution decreases the dimensionality of the feature map from 6


to 4.

Deep Learning – 17 / 26
TRANSPOSED CONVOLUTIONS
Example 3: Transposed Convolution as matrix multiplication:

Figure: A transposed convolution can be used to upsample the feature vector of


length 4 back to a feature vector of length 6.

Note:
Even though the transpose of the original matrix is shown in this
example, the actual values of the weights are different from the
original matrix (and optimized by backpropagation).
The goal of the transposed convolution here is simply to get back
the original dimensionality. It is not necessarily to get back the
original feature map itself.

Deep Learning – 18 / 26
TRANSPOSED CONVOLUTIONS
Example 3: Transposed Convolution as matrix multiplication:

Figure: A transposed convolution can be used to upsample the feature vector of


length 4 back to a feature vector of length 6.
Note:
The elements in the downsampled vector only affect those
elements in the upsampled vector that they were originally
"derived" from. For example, z7 was computed using z1 , z2 and z3
and it is only used to compute z̃1 , z̃2 and z̃3 .
In general, transposing the original matrix does not result in a
convolution. But a transposed convolution can always be
implemented as a regular convolution by using various padding
strategies (this would not be very efficient, however).
Deep Learning – 18 / 26
TRANSPOSED CONVOLUTIONS
Example 4: Let us now view transposed convolutions from a different
perspective.

credit: Stanford University

Figure: Regular 3 × 3 convolution, stride 2, padding 1.

Deep Learning – 19 / 26
TRANSPOSED CONVOLUTIONS
Example 4: Let us now view transposed convolutions from a different
perspective.

credit: Stanford University

Figure: Regular 3 × 3 convolution, stride 2, padding 1.

Deep Learning – 19 / 26
TRANSPOSED CONVOLUTIONS
Example 4: Let us now view transposed convolutions from a different
perspective.

credit: Stanford University

Figure: Transposed 3 × 3 convolution, stride 2, padding 1. Note: stride now


refers to the "stride" in the output.

Here, the filter is scaled by the input.

Deep Learning – 19 / 26
TRANSPOSED CONVOLUTIONS
Example 4: Let us now view transposed convolutions from a different
perspective.

credit: Stanford University

Figure: Transposed 3 × 3 convolution, stride 2, padding 1. Note: stride now


refers to the "stride" in the output.

Here, the filter is scaled by the input.

Deep Learning – 19 / 26
TRANSPOSED CONVOLUTIONS – DRAWBACK

Figure: Artifacts produced by transposed convolutions.

Transposed convolutions lead to checkerboard-style artifacts in


resulting images.

Deep Learning – 20 / 26
TRANSPOSED CONVOLUTIONS – DRAWBACK
Explanation: transposed convolution yields an overlap in some feature
map values.
This leads to higher magnitude for some feature map elements than for
others, resulting in the checkerboard pattern.
One solution is to ensure that the kernel size is divisible by the stride.

Figure: 1D example. In both images, top row = input and bottom row = output. Top:
Here, kernel weights overlap unevenly which results in a checkerboard pattern. Bottom:
There is no checkerboard pattern as the kernel size is divisible by the stride.

Deep Learning – 21 / 26
TRANSPOSED CONVOLUTIONS – DRAWBACK
Solutions:
Increase dimensionality via upsampling (bilinear, nearest
neighbor) and then convolve this output with regular
convolution.
Make sure that the kernel size k is divisible by the stride s.

Figure: Nearest neighbor upsampling and subsequent same convolution to


avoid checkerboard patterns.

Deep Learning – 22 / 26
REFERENCES
Dumoulin, Vincent and Visin, Francesco (2016)
A guide to convolution arithmetic for deep learning
https: // arxiv. org/ abs/ 1603. 07285v1
Van den Oord, Aaron, Sander Dielman, Karen Simonyan, Oriol Vinyals, Alex
Graves, Nal Kalchbrenner, and Koray Kavukocuoglu (2016)
WaveNet: A Generative Model for Raw Audio
https: // arxiv. org/ abs/ 1609. 03499
Benoit A., Gennart, Bernard Krummenacher, Roger D. Hersch, Bernard Saugy,
J.C. Hadorn and D. Mueller (1996)
The Giga View Multiprocessor Multidisk Image Server
https: // www. researchgate. net/ publication/ 220060811_ The_ Giga_
View_ Multiprocessor_ Multidisk_ Image_ Server
Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani and Paluri Manohar
(2015)
Learning Spatiotemporal Features with 3D Convolutional Networks
https: // arxiv. org/ pdf/ 1412. 0767. pdf

Deep Learning – 23 / 26
REFERENCES
Milletari, Fausto, Nassir Navab and Seyed-Ahmad Ahmadi (2016)
V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image
Segmentation
https: // arxiv. org/ pdf/ 1606. 04797. pdf
Zhang, Xiang, Junbo Zhao and Yann LeCun (2015)
Character-level Convolutional Networks for Text Classification
http: // arxiv. org/ abs/ 1509. 01626
Wang, Zhiguang, Weizhong Yan and Tim Oates (2017)
Time Series Classification from Scratch with Deep Neural Networks: A Strong
Baseline
http: // arxiv. org/ abs/ 1509. 01626
Fisher Yu and Vladlen Koltun (2015)
Multi-Scale Context Aggregation by Dilated Convolutions
https: // arxiv. org/ abs/ 1511. 07122

Deep Learning – 24 / 26
REFERENCES
Bai, Shaojie, Zico J. Kolter and Vladlen Koltun (2018)
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for
Sequence Modeling
http: // arxiv. org/ abs/ 1509. 01626
Augustus Odena, Vincent Dumoulin and Chris Olah (2016)
Deconvolution and Checkerboard Artifacts
https: // distill. pub/ 2016/
deconv-checkerboard/ https://distill.pub/2016/deconv-checkerboard/
Andre Araujo, Wade Norris and Jack Sim (2019)
Computing Receptive Fields of Convolutional Neural Networks
https: // distill. pub/ 2019/ computing-receptive-fields/
Zhiguang Wang, Yan, Weizhong and Tim Oates (2017)
Time series classification from scratch with deep neural networks: A strong
baseline
https: // arxiv. org/ 1611. 06455

Deep Learning – 25 / 26
REFERENCES
Lin, Haoning and Shi, Zhenwei and Zou, Zhengxia (2017)
Maritime Semantic Labeling of Optical Remote Sensing Images with Multi-Scale
Fully Convolutional Network

Deep Learning – 26 / 26
Deep Learning

Separable Convolutions and Flattening

Learning goals
Separable Convolutions
Flattening
Separable Convolutions

Deep Learning – 1 / 17
SEPARABLE CONVOLUTIONS
Separable Convolutions are used in some neural net architectures,
such as the MobileNet.
Motivation: make convolution computationally more efficient.
One can perform:
spatially separable convolution
depthwise separable convolution.
The spatially separable convolution operates on the 2D spatial
dimensions of images, i.e. height and width. Conceptually, spatially
separable convolution decomposes a convolution into two separate
operations.
Consider the sobel kernel from the previous lecture:
 
+1 0 −1
Gx = +2 0 −2
+1 0 −1

Deep Learning – 2 / 17
SEPARABLE CONVOLUTIONS
this 3x3 dimensional kernel can be replaced by the outer product
of two 3x1 and 1x3 dimensional kernels:
 
+1  
+2 ∗ +1 0 −1
+1

Convolving with both filters subsequently has a similar effect,


reduces the amount of parameters to be stored and thus improves
speed:

Deep Learning – 3 / 17
SEPARABLE CONVOLUTIONS

Figure: In convolution, the 3x3 kernel directly convolves with the image. In
spatially separable convolution, the 3x1 kernel first convolves with the image.
Then the 1x3 kernel is applied. This would require 6 instead of 9 parameters
while doing the same operations.

Deep Learning – 4 / 17
SPATIALLY SEPARABLE CONVOLUTION
Example 1: A convolution on a 5 × 5 image with a 3 × 3 kernel
(stride=1, padding=0) requires scanning the kernel at 3 positions
horizontally and 3 vertically. That is 9 positions in total, indicated as the
dots in the image below. At each position, 9 element-wise
multiplications are applied. Overall, that is 9 x 9 = 81 multiplications.

Figure: Standard convolution with 1 channel.

Deep Learning – 5 / 17
SPATIALLY SEPARABLE CONVOLUTION

Figure: Spatially separable convolution with 1 channel. Overall, the spatially


separable convolution takes 45 + 27 = 72 multiplications. (Image source: Bai
(2019))

Note: However, despite their advantages, spatial separable


convolutions are seldom applied in deep learning. This is mainly due to
not all kernels being able to get divided into two smaller ones.
Replacing all standard convolutions by spatial separable would also
introduce a limit in searching for all possible kernels in the training
process, implying worse training results.

Deep Learning – 6 / 17
DEPTHWISE SEPARABLE CONVOLUTION
The depthwise separable convolutions, which is much more
commonly used in deep learning (e.g. in MobileNet and Xception).
This convolution separates convolutional process into two stages
of depthwise and pointwise.

Figure: Comparison between standard cnn and separable depthwise cnn

Deep Learning – 7 / 17
DEPTHWISE SEPARABLE CONVOLUTION

Figure: Comparision of number of multiplications in Depthwise separable cnn


and standard cnn

Therefore, fewer computations leads faster network.

Deep Learning – 8 / 17
DEPTHWISE SEPARABLE CONVOLUTION

Figure: Comparision of number of multiplications in Depthwise separable cnn


and standard cnn

Deep Learning – 9 / 17
DEPTHWISE CONVOLUTION
As the name suggests, we perform kernel on depth of the input volume
(on the input channels). The steps followed in this convolution are:
Take number of kernels equal to the number of input channels,
each kernel having depth 1. Example, if we have a kernel of size
3 × 3 and an input of size 6 × 6 with 16 channels, then there will
be 16 × 3 × 3 kernels.
Every channel thus has 1 kernel associated with it. This kernel is
convolved over the associated channel separately resulting in 16
feature maps.
Stack all these feature maps to get the output volume with 4 × 4
output size and 16 channels.

Deep Learning – 10 / 17
POINTWISE CONVOLUTION
As the name suggests, this type of convolution is applied to every single
point in the convolution separately (remember 1 × 1 convs?). So how
does this work?
Take a 1 × 1 conv with number of filters equal to number of
channels you want as output.
Perform basic convolution applied in 1 × 1 conv to the output of
the Depth-wise convolution.

Deep Learning – 11 / 17
Flattening

Deep Learning – 12 / 17
FLATTENING
Flattening is converting the data into a 1-dimensional array for inputting
it to the next layer. We flatten the output of the convolutional layers to
create a single long feature vector. And it is connected to the final
classification model, which is called a fully-connected layer.

Deep Learning – 13 / 17
REFERENCES
Dumoulin, Vincent and Visin, Francesco (2016)
A guide to convolution arithmetic for deep learning
https: // arxiv. org/ abs/ 1603. 07285v1
Van den Oord, Aaron, Sander Dielman, Karen Simonyan, Oriol Vinyals, Alex
Graves, Nal Kalchbrenner, and Koray Kavukocuoglu (2016)
WaveNet: A Generative Model for Raw Audio
https: // arxiv. org/ abs/ 1609. 03499
Benoit A., Gennart, Bernard Krummenacher, Roger D. Hersch, Bernard Saugy,
J.C. Hadorn and D. Mueller (1996)
The Giga View Multiprocessor Multidisk Image Server
https: // www. researchgate. net/ publication/ 220060811_ The_ Giga_
View_ Multiprocessor_ Multidisk_ Image_ Server
Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani and Paluri Manohar
(2015)
Learning Spatiotemporal Features with 3D Convolutional Networks
https: // arxiv. org/ pdf/ 1412. 0767. pdf

Deep Learning – 14 / 17
REFERENCES
Milletari, Fausto, Nassir Navab and Seyed-Ahmad Ahmadi (2016)
V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image
Segmentation
https: // arxiv. org/ pdf/ 1606. 04797. pdf
Zhang, Xiang, Junbo Zhao and Yann LeCun (2015)
Character-level Convolutional Networks for Text Classification
http: // arxiv. org/ abs/ 1509. 01626
Wang, Zhiguang, Weizhong Yan and Tim Oates (2017)
Time Series Classification from Scratch with Deep Neural Networks: A Strong
Baseline
http: // arxiv. org/ abs/ 1509. 01626
Fisher Yu and Vladlen Koltun (2015)
Multi-Scale Context Aggregation by Dilated Convolutions
https: // arxiv. org/ abs/ 1511. 07122

Deep Learning – 15 / 17
REFERENCES
Bai, Shaojie, Zico J. Kolter and Vladlen Koltun (2018)
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for
Sequence Modeling
http: // arxiv. org/ abs/ 1509. 01626
Augustus Odena, Vincent Dumoulin and Chris Olah (2016)
Deconvolution and Checkerboard Artifacts
https: // distill. pub/ 2016/
deconv-checkerboard/ https://distill.pub/2016/deconv-checkerboard/
Andre Araujo, Wade Norris and Jack Sim (2019)
Computing Receptive Fields of Convolutional Neural Networks
https: // distill. pub/ 2019/ computing-receptive-fields/
Zhiguang Wang, Yan, Weizhong and Tim Oates (2017)
Time series classification from scratch with deep neural networks: A strong
baseline
https: // arxiv. org/ 1611. 06455

Deep Learning – 16 / 17
REFERENCES
Lin, Haoning and Shi, Zhenwei and Zou, Zhengxia (2017)
Maritime Semantic Labeling of Optical Remote Sensing Images with Multi-Scale
Fully Convolutional Network

Deep Learning – 17 / 17
Deep Learning

Modern Architectures - I
Learning goals
LeNet
AlexNet
VGG
Network in Network
LeNet

Deep Learning – 1 / 18
LENET ARCHITECTURE
Pioneering work on CNNs by Yann Lecun in 1998.
Applied on the MNIST dataset for automated handwritten digit
recognition.
Consists of convolutional, "subsampling" and dense layers.
Complexity and depth of the net was mainly restricted by limited
computational power back in the days.

Figure: LeNet architecture: two conv layers with subsampling, followed by


dense layers and a ’Gaussian connections’ layer.

Deep Learning – 2 / 18
LENET ARCHITECTURE
A neuron in a subsampling layer looks at a 2 × 2 region of a feature
map, sums the four values, multiplies it by a trainable coefficient,
adds a trainable bias and then applies a sigmoid activation.
A stride of 2 ensures that the size of the feature map reduces by
about a half.
The ’Gaussian connections’ layer has a neuron for each possible
class.
The output of each neuron in this layer is the (squared) Euclidean
distance between the activations from the previous layer and the
weights of the neuron.

Deep Learning – 3 / 18
AlexNet

Deep Learning – 4 / 18
ALEXNET
AlexNet, which employed an 8-layer CNN, won the ImageNet
Large Scale Visual Recognition (LSVR) Challenge 2012 by a
phenomenally large margin.
The network trained in parallel on two small GPUs, using two
streams of convolutions which are partly interconnected.
The architectures of AlexNet and LeNet are very similar, but there
are also significant differences:
First, AlexNet is deeper than the comparatively small LeNet5.
AlexNet consists of eight layers: five convolutional layers, two
fully-connected hidden layers, and one fully-connected output
layer.
Second, AlexNet used the ReLU instead of the sigmoid as its
activation function.

Deep Learning – 5 / 18
ALEXNET

Figure: From LeNet (left) to AlexNet (right).

Deep Learning – 6 / 18
VGG

Deep Learning – 7 / 18
VGG BLOCKS
The block composed of convolutions with 3 × 3 kernels with
padding of 1 (keeping height and width) and 2 × 2 max pooling
with stride of 2 (halving the resolution after each block).
The use of blocks leads to very compact representations of the
network definition.
It allows for efficient design of complex networks.

credit : D2DL

Figure: VGG block.

Deep Learning – 8 / 18
VGG NETWORK
Architecture introduced by Simonyan and Zisserman, 2014 as
“Very Deep Convolutional Network”.
A deeper variant of the AlexNet.
Basic idea is to have small filters and Deeper networks
Mainly uses many cnn layers with a small kernel size 3 × 3.
Stack of three 3 × 3 cnn (stride 1) layers has same effective
receptive field as one 7 × 7 conv layer.
Performed very well in the ImageNet Challenge in 2014.
Exists in a small version (VGG16) with a total of 16 layers (13 cnn
layers and 3 fc layers) using 5 VGG blocks and a larger version
(VGG19) with 19 layers (16 cnn layers and 3 fc layers) in 6 VGG
blocks.

Deep Learning – 9 / 18
VGG NETWORK

credit : D2DL

Figure: From AlexNet to VGG that is designed from building blocks.

Deep Learning – 10 / 18
Network in Network (NiN)

Deep Learning – 11 / 18
NIN BLOCKS
The idea behind NiN is to apply a fully-connected layer at each
pixel location. If we tie the weights across each spatial location, we
could think of this as a 1 × 1 convolutional layer.
The NiN block consists of one convolutional layer followed by two
1 × 1 convolutional layers that act as per-pixel fully-connected
layers with ReLU activations.
The convolution window shape of the first layer is typically set by
the user. The subsequent window shapes are fixed to 1 × 1.

credit : D2DL

Figure: NiN block.

Deep Learning – 12 / 18
GLOBAL AVERAGE POOLING
Goal: tackle overfitting in the final fully connected layer.
The elements of the final feature maps are connected to the
output layer via a dense layer. This could require a huge
number of weights increasing the danger of overfitting.
Example: 1024 feature maps of dim 7x7 connected to 10
output neurons lead to 1024 · 72 · 10 weights for the final
dense layer.
The larger the feature map, the more detrimental.
Classic pooling not ideal as it removes spatial information and
is mainly used for dimension and parameter reduction.

Deep Learning – 13 / 18
GLOBAL AVERAGE POOLING
Solution:
Average each final feature map to the element of one global
average pooling (GAP) vector.
Example: 1024 feature maps are now reduced to GAP-vector
of length 1024 yielding a final dense layer with 1024 · 10
weights.

Figure: An Example of Fully Connected Layer VS Global Average Pooling


Layer

Deep Learning – 14 / 18
GLOBAL AVERAGE POOLING
GAP preserves whole information from the single feature maps
whilst decreasing the dimension.
Mitigates the possibly destructive effect of pooling.
Each element of the GAP output represents the activation of a
certain feature on the input data.
Acts as an additional regularizer on the final fully connected layer.

Deep Learning – 15 / 18
NETWORK IN NETWORK (NIN)

NiN uses blocks consisting of a convolutional layer and multiple


1 × 1 convolutional layers. This can be used within the
convolutional stack to allow for more per-pixel nonlinearity.
NiN removes the fully-connected layers and replaces them with
global average pooling (i.e., summing over all locations) after
reducing the number of channels to the desired number of outputs
(e.g., 10 for Fashion-MNIST).
Removing the fully-connected layers reduces overfitting. NiN has
dramatically fewer parameters.
The NiN design influenced many subsequent CNN designs.

Deep Learning – 16 / 18
NETWORK IN NETWORK (NIN)

credit : D2DL

Figure: Comparing architectures of VGG and NiN, and their blocks.

Deep Learning – 17 / 18
REFERENCES
B. Zhou, Khosla, A., Labedriza, A., Oliva, A. and A. Torralba (2016)
Deconvolution and Checkerboard Artifacts
http: // cnnlocalization. csail. mit. edu/ Zhou_ Learning_ Deep_
Features_ CVPR_ 2016_ paper. pdf
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir
Anguelov, Dumitru Erhan, Vincent Vanhoucke and Andrew Rabinovich (2014)
Going deeper with convolutions
https: // arxiv. org/ abs/ 1409. 4842
Kaiming He, Zhang, Xiangyu, Ren, Shaoqing, and Jian Sun (2015)
Deep Residual Learning for Image Recognition
https: // arxiv. org/ abs/ 1512. 03385
Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva and Antonio Torralba
(2016)
Learning Deep Features for Discriminative Localization
http: // cnnlocalization. csail. mit. edu/ Zhou_ Learning_ Deep_
Features_ CVPR_ 2016_ paper. pdf

Deep Learning – 18 / 18
Deep Learning

Modern Architectures - II

Learning goals
GoogleNet
ResNet
DenseNet
U-Net
GoogLeNet

Deep Learning – 1 / 24
INCEPTION MODULES
The Inception block is equivalent to a subnetwork with four paths.
It extracts information in parallel through convolutional layers of
different window shapes and max-pooling layers.
1 × 1 convolutions reduce channel dimensionality on a per-pixel
level. Max-pooling reduces the resolution.

Figure: Inception Block.

Deep Learning – 2 / 24
GOOGLENET ARCHITECTURE
GoogLeNet connects multiple well-designed Inception blocks with
other layers in series.
The ratio of the number of channels assigned in the Inception
block is obtained through a large number of experiments on the
ImageNet dataset.
GoogLeNet, as well as its succeeding versions, was one of the
most efficient models on ImageNet, providing similar test accuracy
with lower computational complexity.

Deep Learning – 3 / 24
GOOGLENET ARCHITECTURE

credit : D2DL

Figure: The GoogLeNet architecture.

Deep Learning – 4 / 24
Residual Networks (ResNet)

Deep Learning – 5 / 24
RESIDUAL BLOCK (SKIP CONNECTIONS)
Problem setting: theoretically, we could build infinitely deep
architectures as the net should learn to pick the beneficial layers and
skip those that do not improve the performance automatically.

credit : D2DL

Figure: For non-nested function classes, a larger function class does not
guarantee to get closer to the “truth” function (F∗). This does not happen in
nested function classes.

Deep Learning – 6 / 24
RESIDUAL BLOCK (SKIP CONNECTIONS)
But: this skipping would imply learning an identity mapping
x = F(x). It is very hard for a neural net to learn such a 1:1
mapping through the many non-linear activations in the
architecture.
Solution: offer the model explicitly the opportunity to skip certain
layers if they are not useful.
Introduced in He et. al , 2015 and motivated by the observation
that stacking evermore layers increases the test- as well as the
train-error (̸= overfitting).

Deep Learning – 7 / 24
RESIDUAL BLOCK (SKIP CONNECTIONS)

credit : D2DL

Figure: A regular block (left) and a residual block (right).

Deep Learning – 8 / 24
RESIDUAL BLOCK (SKIP CONNECTIONS)

credit : D2DL

Figure: ResNet block with and without 1 × 1 convolution.The information flows


through two layers and the identity function. Both streams of information are
then element-wise summed and jointly activated.

Deep Learning – 9 / 24
RESIDUAL BLOCK (SKIP CONNECTIONS)
Let H(x) be the optimal underlying mapping that should be
learned by (parts of) the net.
x is the input in layer l (can be raw data input or the output of a
previous layer).
H(x) is the output from layer l.
Instead of fitting H(x), the net is ought to learn the residual
mapping F(x) := H(x) − x whilst x is added via the identity
mapping.
Thus, H(x) = F(x) + x, as formulated on the previous slide.
The model should only learn the residual mapping F(x)
Thus, the procedure is also referred to as Residual Learning.

Deep Learning – 10 / 24
RESIDUAL BLOCK (SKIP CONNECTIONS)
The element-wise addition of the learned residuals F(x) and the
identity-mapped data x requires both to have the same
dimensions.
To allow for downsampling within F(x) (via pooling or valid-padded
convolutions), the authors introduce a linear projection layer Ws .
Ws ensures that x is brought to the same dimensionality as F(x)
such that:
y = F(x) + Ws x,
y is the output of the skip module and Ws represents the weight
matrix of the linear projection (# rows of Ws = dimensionality of
F(x)).
This idea applies to fully connected layers as well as to
convolutional layers.

Deep Learning – 11 / 24
RESNET ARCHITECTURE
The residual mapping can learn the identity function more easily,
such as pushing parameters in the weight layer to zero.
We can train an effective deep neural network by having residual
blocks.
Inputs can forward propagate faster through the residual
connections across layers.
ResNet had a major influence on the design of subsequent deep
neural networks, both for convolutional and sequential nature.

Deep Learning – 12 / 24
RESNET ARCHITECTURE

credit : D2DL

Figure: The ResNet-18 architecture.

Deep Learning – 13 / 24
Densely Connected Networks (DenseNet)

Deep Learning – 14 / 24
FROM RESNET TO DENSENET
ResNet significantly changed the view of how to parametrize the
functions in deep networks.
DenseNet (dense convolutional network) is to some extent the
logical extension of this [Huang et al., 2017].
Dense blocks where each layer is connected to every other layer in
feedforward fashion.
Alleviates vanishing gradient, strengthens feature propagation,
encourages feature reuse.
To understand how to arrive at it, let us take a small detour to
mathematics:
Recall the Taylor expansion for functions. For the point x = 0
it can be written as:
f ′′ (0) f ′′′ (0)
f (x ) = f (0) + f ′ (0)x + 2! x 2 + 3! x 3 + . . . .

Deep Learning – 15 / 24
FROM RESNET TO DENSENET
The key point is that it decomposes a function into
increasingly higher order terms. In a similar vein, ResNet
decomposes functions into : f (x) = x + g (x).
That is, ResNet decomposes f into a simple linear term and a
more complex nonlinear one. What if we want to capture (not
necessarily add) information beyond two terms? One solution
was DenseNet [Huang et al., 2017].

credit : D2DL

Figure: DensNet Block.

Deep Learning – 16 / 24
FROM RESNET TO DENSENET
As shown in previous Figure, the key difference between ResNet and
DenseNet is that in the latter case outputs are concatenated (denoted by [, ] )
rather than added. As a result, we perform a mapping from x to its values after
applying an increasingly complex sequence of functions:
x → [x, f1 (x), f2 ([x, f1 (x)]), f3 ([x, f1 (x), f2 ([x, f1 (x)])]), . . .] .
In the end, all these functions are combined in MLP to reduce the number of
features again. In terms of implementation this is quite simple: rather than
adding terms, we concatenate them.
The name DenseNet arises from the fact that the dependency graph between
variables becomes quite dense. The last layer of such a chain is densely
connected to all previous layers.

credit : D2DL

Figure: The DensNet architecture.

Deep Learning – 17 / 24
U-Net

Deep Learning – 18 / 24
U-NET
U-Net is a fully convolutional net that makes use of upsampling (via
transposed convolutions, for example) as well as skip connections.
Input images are getting convolved and down-sampled in the first
half of the architecture.
Then, they are getting upsampled and convolved again in the
second half to get back to the input dimension.
Skip connections throughout the net combine feature maps from
earlier layers with those from later layers by concatenating both
sets of maps along the depth/channel axis.
Only convolutional and no dense layers are used.

Deep Learning – 19 / 24
U-NET

Figure: Illustration of the architecture. Blue arrows are convolutions, red


arrows max-pooling operations, green arrows upsampling steps and the brown
arrows merge layers with skip connections. The height and width of the feature
blocks are shown on the vertical and the depth on the horizontal. D are
dropout layers.

Deep Learning – 20 / 24
U-NET
Example problem setting: train a neural net to pixelwise segment
roads in satellite imagery.
Answer the question: Where is the road map?

Figure: Model prediction on a test satellite image. Yellow are correctly


identified pixels, blue false negatives and red false positives.

Deep Learning – 21 / 24
U-NET
The net takes an RGB image [512, 512, 3] and outputs a binary
(road / no road) probability mask [512, 512, 1] for each pixel.
The model is trained via a binary cross entropy loss which was
combined over each pixel.

Figure: Scheme for the input/ output of the net architecture.

Deep Learning – 22 / 24
REFERENCES
B. Zhou, Khosla, A., Labedriza, A., Oliva, A. and A. Torralba (2016)
Deconvolution and Checkerboard Artifacts
http: // cnnlocalization. csail. mit. edu/ Zhou_ Learning_ Deep_
Features_ CVPR_ 2016_ paper. pdf
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir
Anguelov, Dumitru Erhan, Vincent Vanhoucke and Andrew Rabinovich (2014)
Going deeper with convolutions
https: // arxiv. org/ abs/ 1409. 4842
Kaiming He, Zhang, Xiangyu, Ren, Shaoqing, and Jian Sun (2015)
Deep Residual Learning for Image Recognition
https: // arxiv. org/ abs/ 1512. 03385
Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva and Antonio Torralba
(2016)
Learning Deep Features for Discriminative Localization
http: // cnnlocalization. csail. mit. edu/ Zhou_ Learning_ Deep_
Features_ CVPR_ 2016_ paper. pdf

Deep Learning – 23 / 24
REFERENCES
Olaf Ronneberger, Philipp Fischer, Thomas Brox (2015)
U-Net: Convolutional Networks for Biomedical Image Segmentation
http: // arxiv. org/ abs/ 1505. 04597

Deep Learning – 24 / 24
Deep Learning

Recurrent Neural Networks - Introduction

Learning goals
Why do we need them?
How do they work?
Computational Graph o
Recurrent Networks
Motivation

Deep Learning – 1 / 21
MOTIVATION FOR RECURRENT NETWORKS
The two types of neural network architectures that we’ve seen so
far are fully-connected networks and CNNs.
Their input layers have a fixed size and (typically) only handle
fixed-length inputs.
The primary reason: if we vary the size of the input layer, we would
also have to vary the number of learnable weights in the network.
This in particular relates to sequence data such as time-series,
audio and text.
Recurrent Neural Networks (RNNs) is a class of architectures
that allows varying input lengths and properly accounts for the
ordering in sequence data.

Deep Learning – 2 / 21
RNNS - INTRODUCTION
Suppose we have some text data and our task is to analyse the
sentiment in the text.
For example, given an input sentence, such as "This is good news.", the
network has to classify it as either ’positive’ or ’negative’.
We would like to train a simple neural network (such as the one below) to
perform the task.

Figure: Two equivalent visualizations of a dense net with a single hidden layer, where
the left is more abstract showing the network on a layer point-of-view.
Deep Learning – 3 / 21
RNNS - INTRODUCTION
Because sentences can be of varying lengths, we need to modify
the dense net architecture to handle such a scenario.
One approach is to draw inspiration from the way a human reads a
sentence; that is, one word at a time.
An important cognitive mechanism that makes this possible is
"short-term memory".
As we read a sentence from beginning to end, we retain some
information about the words that we have already read and use
this information to understand the meaning of the entire sentence.
Therefore, in order to feed the words in a sentence sequentially to
a neural network, we need to give it the ability to retain some
information about past inputs.

Deep Learning – 4 / 21
RNNS - INTRODUCTION
When words in a sentence are fed to the network one at a time,
the inputs are no longer independent. It is much more likely that
the word "good" is followed by "morning" rather than "plastic".
Hence, we also need to model this (long-term) dependency.
Each word must still be encoded as a fixed-length vector because
the size of the input layer will remain fixed.
Here, for the sake of the visualization, each word is represented as
a ’one-hot coded’ vector of length 5. (<eos> = ’end of sequence’)

While this is one option to represent words in a network, the


standard approach are word embeddings (more on this later).

Deep Learning – 5 / 21
RNNS - INTRODUCTION
Our goal is to feed the words to the network sequentially in
discrete time-steps.
A regular dense neural network with a single hidden layer only has
two sets of weights: ’input-to-hidden’ weights W and ’hidden-to-
output’ weights U.

Deep Learning – 6 / 21
RNNS - INTRODUCTION
In order to enable the network to retain information about past inputs, we
introduce an additional set of weights V, from the hidden neurons at
time-step t to the hidden neurons at time-step t + 1.
Having this additional set of weights makes the activations of the hidden
layer depend on both the current input and the activations for the
previous input.

Figure: Input-to-hidden weights W and hidden-to-hidden weights V. The


hidden-to-output weights U are not shown for better readability.

Deep Learning – 7 / 21
RNNS - INTRODUCTION
With this additional set of hidden-to-hidden weights V, the network
is now a Recurrent Neural Network (RNN).
In a regular feed-forward network, the activations of the hidden
layer are only computed using the input-hidden weights W (and
bias b).
z = σ(W⊤ x + b)
In an RNN, the activations of the hidden layer (at time-step t) are
computed using both the input-to-hidden weights W and the
hidden-to-hidden weights V.

z[t ] = σ(V⊤ z[t−1] + W⊤ x[t ] + b)

The vector z[t ] represents the short-term memory of the RNN


because it is a function of the current input x[t ] and the activations
z[t −1] of the previous time-step.
Therefore, by recurrence, it contains a "summary" of all previous
inputs.
Deep Learning – 8 / 21
Examples

Deep Learning – 9 / 21
APPLICATION EXAMPLE - SENTIMENT ANALYSIS
At t = 0, we feed the word "This" to the network and obtain z[0] .
z[0] = σ(W⊤ x[0] + b)

Because this is the very first input, there is no past state (or,
equivalently, the state is initialized to 0).

Deep Learning – 10 / 21
APPLICATION EXAMPLE - SENTIMENT ANALYSIS
At t = 1, we feed the second word to the network to obtain z[1] .
z[1] = σ(V⊤ z[0] + W⊤ x[1] + b)

Deep Learning – 11 / 21
APPLICATION EXAMPLE - SENTIMENT ANALYSIS
At t = 2, we feed the next word in the sentence.
z[2] = σ(V⊤ z[1] + W⊤ x[2] + b)

Deep Learning – 12 / 21
APPLICATION EXAMPLE - SENTIMENT ANALYSIS
At t = 3, we feed the next word ("news") in the sentence.
z[3] = σ(V⊤ z[2] + W⊤ x[3] + b)

Deep Learning – 13 / 21
APPLICATION EXAMPLE - SENTIMENT ANALYSIS
Once the entire input sequence has been processed, the
prediction of the network can be generated by feeding the
activations of the final time-step to the output neuron(s).
f = σ(U⊤ z[4] + c ), where c is the bias of the output neuron.

Deep Learning – 14 / 21
PARAMETER SHARING
This way, the network can process the sentence one word at a
time and the length of the network can vary based on the length of
the sequence.
It is important to note that no matter how long the input sequence
is, the matrices W and V are the same in every time-step. This is
another example of parameter sharing.
Therefore, the number of weights in the network is independent of
the length of the input sequence.

Deep Learning – 15 / 21
RNNS - USE CASE SPECIFIC ARCHITECTURES
RNNs are very versatile. They can be applied to a wide range of tasks.

Figure: RNNs can be used in tasks that involve multiple inputs and/or multiple outputs.

Examples:
Sequence-to-One: Sentiment analysis, document classification.
One-to-Sequence: Image captioning.
Sequence-to-Sequence: Language modelling, machine translation,
time-series prediction.

Deep Learning – 16 / 21
Computational Graph

Deep Learning – 17 / 21
RNNS - COMPUTATIONAL GRAPH

On the left is an abstract representation of the computational


graph for the network on the right. A loss function L measures how
far each output f is from the corresponding training target y .

Deep Learning – 18 / 21
RNNS - COMPUTATIONAL GRAPH

A helpful way to think of an RNN is as multiple copies of the same


network, each passing a message to a successor.
RNNs are networks with loops, allowing information to persist.

Deep Learning – 18 / 21
RNNS - COMPUTATIONAL GRAPH

Things might become more clear if we unfold the architecture.


We call z[t ] the state of the system at time t.
Tthe state contains information about the whole past sequence.

Deep Learning – 18 / 21
RNNS - COMPUTATIONAL GRAPH

We went from

f = τ (c + U⊤ σ(b + W⊤ x)) for the dense net, to


f [t ] = τ (c + U⊤ σ(b + V⊤ z[t −1] + W⊤ x[t ] )) for the RNN.

Deep Learning – 18 / 21
RNNS - COMPUTATIONAL GRAPH

A potential computational graph for time-step t with

f [t ] = τ (c + U⊤ σ(b + V⊤ z[t −1] + W⊤ x[t ] ))

Deep Learning – 18 / 21
RECURRENT OUTPUT-HIDDEN CONNECTIONS
Recurrent connections do not need to map from hidden to hidden
neurons!

RNN with feedback connection from the output to the hidden layer. The RNN is only
allowed to send f to future time points and, hence, z [t −1] is connected to z [t ] only
indirectly, via the predictions f [t −1] .

Deep Learning – 19 / 21
SEQ-TO-ONE MAPPINGS
RNNs do not need to produce an output at each time step. Often only
one output is produced after processing the whole sequence.

Time-unfolded recurrent neural network with a single output at the end of the
sequence. Such a network can be used to summarize a sequence and produce a fixed
size representation.

Deep Learning – 20 / 21
REFERENCES
Ian Goodfellow, Yoshua Bengio and Aaron Courville (2016)
Deep Learning
http: // www. deeplearningbook. org/
Andrej Karpathy (2015)
The Unreasonable Effectiveness of Recurrent Neural Networks
http: // karpathy. github. io/ 2015/ 05/ 21/ rnn-effectiveness/

Deep Learning – 21 / 21
Deep Learning

Recurrent Neural Networks -


Backpropogation

Learning goals
How does Backpropagation work
for RNNs?
Exploding and Vanishing
Gradients
SIMPLE EXAMPLE: CHARACTER LEVEL
LANGUAGE MODEL
Task: Learn character probability distribution from input text
Suppose we only had a vocabulary of four possible letters: “h”, “e”,
“l” and “o”
We want to train an RNN on the training sequence “hello”.
This training sequence is in fact a source of 4 separate training
examples:
The probability of “e” should be likely given the context of “h”
“l” should be likely in the context of “he”
“l” should also be likely given the context of “hel”
and “o” should be likely given the context of “hell”

Deep Learning – 1 / 13
SIMPLE EXAMPLE: CHARACTER LEVEL
LANGUAGE MODEL

Deep Learning – 2 / 13
SIMPLE EXAMPLE: CHARACTER LEVEL
LANGUAGE MODEL

Deep Learning – 3 / 13
SIMPLE EXAMPLE: CHARACTER LEVEL
LANGUAGE MODEL

Deep Learning – 4 / 13
SIMPLE EXAMPLE: CHARACTER LEVEL
LANGUAGE MODEL
The RNN has a 4-dimensional input and output. The exemplary hidden
layer consists of 3 neurons. This diagram shows the activations in the
forward pass when the RNN is fed the characters “hell” as input. The
output contains confidences the RNN assigns for the next character.

Our goal is to increase the


confidence for the correct
letters (green digits) and
decrease the confidence of
all others (we could also
use a softmax activation to
squash the digits to
probabilities ∈ [0, 1]).

Deep Learning – 5 / 13
SIMPLE EXAMPLE: CHARACTER LEVEL
LANGUAGE MODEL
The RNN has a 4-dimensional input and output. The exemplary hidden
layer consists of 3 neurons. This diagram shows the activations in the
forward pass when the RNN is fed the characters “hell” as input. The
output contains confidences the RNN assigns for the next character.

Our goal is to increase the


confidence for the correct
letters (green digits) and
decrease the confidence of
all others (we could also
use a softmax activation to
squash the digits to
probabilities ∈ [0, 1]). How
can we now train the
network?
Backpropagation
through time!
Deep Learning – 5 / 13
BACKPROPAGATION THROUGH TIME

dL
For training the RNN, we need to compute du i ,j
, dvdLi ,j , and dw
dL
i ,j
.
To do so, during backpropagation at time step t for an arbitrary
RNN, we need to compute
dL dL dz[t ] dz[2]
= ...
dz[1] dz[t ] dz[t −1] dz[1]
Deep Learning – 6 / 13
LONG-TERM DEPENDENCIES
Here, z[t ] = σ(V> z[t −1] + W> x[t ] + b)
It follows that:

dz[t ]
= diag(σ 0 (V> z[t −1] + W> x[t ] + b))V> = D[t −1] V>
dz[t −1]

dz[t −1]
= diag(σ 0 (V> z[t −2] + W> x[t −1] + b))V> = D[t −2] V>
dz[t −2]
..
.
dz[2]
= diag(σ 0 (V> z[1] + W> x[2] + b))V> = D[1] V>
dz[1]

dL dL dz[t ] dz[2]
= ... = D[t −1] D[t −2] . . . D[1] (V> )t −1
dz¸[1] dz[t ] dz[t −1] dz[1]

Deep Learning – 7 / 13
LONG-TERM DEPENDENCIES
dz[t ]
In general, for an arbitrary time-step i < t in the past, dz[i ]
will
contain the term (V> )t −i (this follows from the chain rule).
Based on the largest eigenvalue of V> , the presence of the term
(V> )t −i can either result in vanishing or exploding gradients.
This problem is quite severe for RNNs (as compared to
feedforward networks) because the same matrix V> is multiplied
several times. Click here
As the gap between t and i increases, the instability worsens.
It is thus quite challenging for RNNs to learn long-term
dependencies. The gradients either vanish (most of the time) or
explode (rarely, but with much damage to the optimization).
That happens simply because we propagate errors over very
many stages backwards.

Deep Learning – 8 / 13
LONG-TERM DEPENDENCIES

Figure: Exploding gradients

Deep Learning – 9 / 13
LONG-TERM DEPENDENCIES
Recall, that we can counteract exploding gradients by
implementing gradient clipping.
To avoid exploding gradients, we simply clip the norm of the
gradient at some threshold h (see chapter 4):

h
if ||∇W || > h : ∇W ← ∇W
||∇W ||

Deep Learning – 10 / 13
LONG-TERM DEPENDENCIES

Figure: Vanishing gradients

Deep Learning – 11 / 13
LONG-TERM DEPENDENCIES
Even for a stable RNN (gradients not exploding), there will be
exponentially smaller weights for long-term interactions compared
to short-term ones and a more sophisticated solution is needed for
this vanishing gradient problem (discussed in the next chapters).
The vanishing gradient problem heavily depends on the choice of
the activation functions.
Sigmoid maps a real number into a “small” range (i.e. [0, 1])
and thus even huge changes in the input will only produce a
small change in the output. Hence, the gradient will be small.
This becomes even worse when we stack multiple layers.
We can avoid this problem by using activation functions which
do not “squash” the input.
The most popular choice is ReLU with gradients being either
0 or 1, i.e., they never saturate and thus don’t vanish.
The downside of this is that we can obtain a “dead” ReLU.

Deep Learning – 12 / 13
REFERENCES
Ian Goodfellow, Yoshua Bengio and Aaron Courville (2016)
Deep Learning
http://www.deeplearningbook.org/
Andrej Karpathy (2015)
The Unreasonable Effectiveness of Recurrent Neural Networks
http://karpathy.github.io/2015/05/21/rnn-eectiveness/

Deep Learning – 13 / 13
Deep Learning

Modern Recurrent Neural Networks

Learning goals
LSTM cell
GRU cell
Bidirectional RNNs
Long Short-Term Memory (LSTM)

Deep Learning – 1 / 15
LONG SHORT-TERM MEMORY (LSTM)
The LSTM provides a way of dealing with vanishing gradients and
modelling long-term dependencies.

A simple RNN mechanism;

Until now, we simply computed

z[t ] = σ(b + V> z[t −1] + W> x[t ] )

Deep Learning – 2 / 15
LONG SHORT-TERM MEMORY (LSTM)
The LSTM provides a way of dealing with vanishing gradients and
modelling long-term dependencies.

Left: A simple RNN mechanism; Right: An LSTM cell

Until now, we simply computed

z[t ] = σ(b + V> z[t −1] + W> x[t ] )

Now we introduce the LSTM cell, a small network on its own.


Deep Learning – 2 / 15
LONG SHORT-TERM MEMORY (LSTM)

The key to LSTMs is the cell state s[t ] .


s[t ] can be manipulated by different gates to forget old information,
add new information, and read information out of it.
Each gate is a vector of the same size as s[t ] with elements
between 0 ("let nothing pass") and 1 ("let everything pass").

Deep Learning – 3 / 15
LONG SHORT-TERM MEMORY (LSTM)

Forget gate e[t ] : indicates which information of the old cell state
we should forget.
Intuition: Think of a model trying to predict the next word based on
all the previous ones. The cell state might include the gender of
the present subject, so that the correct pronouns can be used.
When we now see a new subject, we want to forget the gender of
the old one.

Deep Learning – 3 / 15
LONG SHORT-TERM MEMORY (LSTM)

We obtain the forget gate by computing

e[t ] = σ(be + V>


e z
[ t − 1]
+ W> [t ]
e x )

σ() is a sigmoid and Ve , We are forget gate specific weights.

Deep Learning – 3 / 15
LONG SHORT-TERM MEMORY (LSTM)

To compute the cell state s[t ] , the first step is to multiply


(element-wise) the previous cell state s[t −1] by the forget gate e[t ] .

e[t ] s[t −1] , with e[t ] ∈ [0, 1]

Deep Learning – 3 / 15
LONG SHORT-TERM MEMORY (LSTM)

Input gate i[t ] : indicates which new information should be added


to s[t ] .
Intuition: In our example, this is where we add the new information
about the gender of the new subject.

Deep Learning – 3 / 15
LONG SHORT-TERM MEMORY (LSTM)

The new information is given by


s̃[t ] = tanh(b + V> z[t −1] + W> x[t ] ) ∈ [−1, 1].
The input gate is given by i[t ] = σ(bi + V>
i z
[t −1] + W> x[t ] ) ∈ [0, 1].
i
W and V are weights of the new information, Wi and Vi the weights
of the input gate.

Deep Learning – 3 / 15
LONG SHORT-TERM MEMORY (LSTM)

Now we can finally compute the cell state s[t ] :

s[t ] = e[t ] s[t −1] + i[t ] s̃[t ]

Deep Learning – 3 / 15
LONG SHORT-TERM MEMORY (LSTM)

Output gate o[t ] : Indicates which information form the cell state is
filtered.
It is given by o[t ] = σ(bo + V>
o z
[t −1] + W> x[t ] ), with specific
o
weights Wo , Vo .

Deep Learning – 3 / 15
LONG SHORT-TERM MEMORY (LSTM)

Finally, the new state z[t ] of the LSTM is a function of the cell state,
multiplied by the output gate:

z[t ] = o[t ] tanh(s[t ] )

Deep Learning – 3 / 15
Gated Recurrent Units (GRU)

Deep Learning – 4 / 15
GATED RECURRENT UNITS (GRU)

The key distinction between regular RNNs and GRUs is that the
latter support gating of the hidden state.
Here, we have dedicated mechanisms for when a hidden state
should be updated and also when it should be reset.
These mechanisms are learned to:
avoid the vanishing/exploding gradient problem which comes
with a standard recurrent neural network.
solve the vanishing gradient problem by using an update gate
and a reset gate.
control the information that flows into (update gate) and out of
(reset gate) memory.

Deep Learning – 5 / 15
GATED RECURRENT UNITS (GRU)

Figure: Update gate in a GRU.

For a given time step t, the hidden state of the last time step is
z[t −1] . The update gate u[t ] is computed as follows:
u[t ] = σ(W> [t ] > [t −1] + b )
u x + Vu z u

We use a sigmoid to transform input values to (0, 1).

Deep Learning – 6 / 15
GATED RECURRENT UNITS (GRU)

Figure: Reset gate in a GRU.

Similarly, the reset gate r[t ] is computed as follows:


r[t ] = σ(W> [t ] > [ t − 1] + b )
r x + Vr z r

Deep Learning – 7 / 15
GATED RECURRENT UNITS (GRU)

Figure: Hidden state computation in GRU. Multiplication is carried out elementwise.

z̃[t ] = tanh(W> [t ] > [t ] z[t −1] + bz ).



z x + Vz r
In a conventional RNN, we would have an hidden state update of
the form: z[t ] = tanh(W> [t ] > [t −1] + b ).
z x + Vz z z

Deep Learning – 8 / 15
GATED RECURRENT UNITS (GRU)

Figure: Update gate in a GRU. The multiplication is carried out elementwise.

The update gate u[t ] determines how much the old state z[t −1] and
the new candidate state z̃[t ] is used.
z [t ] = u [t ] z[t −1] + (1 − u[t ] ) z̃[t ] .

Deep Learning – 9 / 15
GATED RECURRENT UNITS (GRU)

Figure: GRU

These designs can help us to eleminate the vanishing gradient problem


in RNNs and capture better dependencies for time series with large
time step distances. In summary, GRUs have the following two
distinguishing features:
Reset gates help capture short-term dependencies in time series.
Update gates help capture long-term dependencies in time series.

Deep Learning – 10 / 15
GRU VS LSTM

Figure: GRU vs LSTM

Deep Learning – 11 / 15
Bidirectional RNNs

Deep Learning – 12 / 15
BIDIRECTIONAL RNNS
Another generalization of the simple RNN are bidirectional RNNs.
These allow us to process sequential data depending on both past
and future inputs, e.g. an application predicting missing words,
which probably depend on both preceding and following words.
One RNN processes inputs in the forward direction from x[1] to x[T ]
computing a sequence of hidden states (z[1] , . . . , z(T ) ), another
RNN in the backward direction from x[T ] to x[1] computing hidden
states (g[T ] , . . . , g[1] )
Predictions are then based on both hidden states, which could be
concatenated.
With connections going back in time, the whole input sequence
must be known in advance to train and infer from the model.
Bidirectional RNNs are often used for the encoding of a sequence
in machine translation.

Deep Learning – 13 / 15
BIDIRECTIONAL RNNS
Computational graph of a bidirectional RNN:

Figure: A bidirectional RNN consists of a forward RNN processing inputs from


left to right and a backward RNN processing inputs backwards in time.

Deep Learning – 14 / 15
REFERENCES
Ian Goodfellow, Yoshua Bengio and Aaron Courville (2016)
Deep Learning
http://www.deeplearningbook.org/

Deep Learning – 15 / 15
Deep Learning

Applications of RNNs

Learning goals
RNN Applications in NLP
RNN Applications in Computer
Vision
Get to know Encoder-Decoder
Architectures
RNN’s Applications

Deep Learning – 1 / 27
RNNS - USE CASE SPECIFIC ARCHITECTURES
RNNs are very versatile. They can be applied to a wide range of tasks.

Figure: RNNs can be used in tasks that involve multiple inputs and/or multiple outputs.

Examples:
One-to-One : Image classification, video frame classification.

Deep Learning – 2 / 27
RNNS - USE CASE SPECIFIC ARCHITECTURES

Examples:
Many-to-One : Here a sequence of multiple steps as input are mapped to
a class or quantity prediction.
Example applications are: Sentiment analysis, document classification,
video classification, visual question answering.

Deep Learning – 3 / 27
RNNS - USE CASE SPECIFIC ARCHITECTURES

Figure: Ragrawal et al, “Visual 7W: Grounded Question Answering in Images”, CVPR
2015 Figures from Agrawal et al, copyright IEEE 2015.

Deep Learning – 4 / 27
RNNS - USE CASE SPECIFIC ARCHITECTURES

Examples:
One-to-Many: In this type of problem, an observation is mapped as input
to a sequence with multiple steps as an output.
Example applications are:
Image captioning: A combination of CNNs and RNNs are
used to provide a description of what exactly is happening
inside an image. CNN does the segmentation part and RNN
then uses the segmented data to recreate the description.
Video tagging: the RNNs can be used for video search where
we can do image description of a video divided into numerous
frames. Deep Learning – 5 / 27
RNNS - USE CASE SPECIFIC ARCHITECTURES

Figure: Show and Tell: A Neural Image Caption Generator (Oriol


Vinyals et al. 2014). A language generating RNN tries to describe in
brief the content of different images.
Deep Learning – 6 / 27
RNNS - USE CASE SPECIFIC ARCHITECTURES

Figure: Show and Tell: A Neural Image Caption Generator (Oriol


Vinyals et al. 2014). A language generating RNN tries to describe in
brief the content of different images.
Deep Learning – 6 / 27
RNNS - USE CASE SPECIFIC ARCHITECTURES

Image captioning is a fundamental task in Artificial Intelligence


which describes objects, attributes, and relationship in an image,
in a natural language form.
It has many applications such as semantic image search, bringing
visual intelligence to chatbots, or helping visually-impaired people
to see the world around them.

Deep Learning – 7 / 27
Seq-to-Seq (Type I)

Deep Learning – 8 / 27
RNNS - LANGUAGE MODELLING
In an earlier example, we built a ’sequence-to-one’ RNN model to
perform ’sentiment analysis’.
Another common task in Natural Language Processing (NLP) is
’language modelling’.
Input: word/character, encoded as a one-hot vector.
Output: probability distribution over words/characters given
previous words
τ
Y
[ 1]
P(y , . . . , y [τ ]
)= P(y [i ] |y [1] , . . . , y [i −1] )
i =1

→ given a sequence of previous characters, ask the RNN to model


the probability distribution of the next character in the sequence!

Deep Learning – 9 / 27
RNNS - LANGUAGE MODELLING
In this example, we will feed the characters in the word "hello" one
at a time to a ’seq-to-seq’ RNN.
For the sake of the visualization, the characters "h", "e", "l" and "o"
are one-hot coded as a vectors of length 4 and the output layer
only has 4 neurons, one for each character (we ignore the <eos>
token).
At each time step, the RNN has to output a probability distribution
(softmax) over the 4 possible characters that might follow the
current input.
Naturally, if the RNN has been trained on words in the English
language:
The probability of “e” should be likely, given the context of “h”.
“l” should be likely in the context of “he”.
“l” should also be likely, given the context of “hel”.
and, finally, “o” should be likely, given the context of “hell”.

Deep Learning – 10 / 27
RNNS - LANGUAGE MODELLING

The probability of “e” should be high, given the context of “h”.

Deep Learning – 11 / 27
RNNS - LANGUAGE MODELLING

The probability of “l” should be high, given in the context of “he”.

Deep Learning – 11 / 27
RNNS - LANGUAGE MODELLING

The probability of “l” should also be high, given in the context of


“hel”.

Deep Learning – 11 / 27
RNNS - LANGUAGE MODELLING

The probability of “o” should be high, given the context of “hell”.

Deep Learning – 11 / 27
RNNS - LANGUAGE MODELLING

During training, our goal would be to increase the confidence for


the correct letters (indicated by the green arrows) and decrease
the confidence of all others.
Deep Learning – 11 / 27
WORD EMBEDDINGS

Source: Kaggle

Figure: Two-dimensional embedding space. Typically, the embedding space is much


higher dimensional.
Instead of one-hot representations of words it is standard practice
to encode each word as a dense (as opposed to sparse) vector of
fixed size that captures its underlying semantic content.
Similar words are embedded close to each other in a
lower-dimensional embedding space.

Deep Learning – 12 / 27
WORD EMBEDDINGS
The dimensionality of these embeddings is typically much smaller
than the number of words in the dictionary.
Using them gives you a "warm start" for any NLP task. It is an
easy way to incorporate prior knowledge into your model and a
rudimentary form of transfer learning.
Two very popular approaches to learn word embeddings are
word2vec by Google and GloVe by Facebook. These embeddings
are typically 100 to 1000 dimensional.
Even though these embeddings capture the meaning of each word
to an extent, they do not capture the semantics of the word in a
given context because each word has a static precomputed
representation. For example, depending on the context, the word
"bank" might refer to a financial institution or to a river bank.

Deep Learning – 13 / 27
Seq-to-Seq (Type II)

Deep Learning – 14 / 27
Encoder-Decoder Architectures

Deep Learning – 15 / 27
ENCODER-DECODER NETWORK
For many interesting applications such as question answering,
dialogue systems, or machine translation, the network needs to
map an input sequence to an output sequence of different length.
This is what an encoder-decoder (also called
sequence-to-sequence architecture) enables us to do!

Deep Learning – 16 / 27
ENCODER-DECODER NETWORK

Figure: In the first part of the network, information from the input is encoded in
the context vector, here the final hidden state, which is then passed on to
every hidden state of the decoder, which produces the target sequence.

Deep Learning – 17 / 27
ENCODER-DECODER NETWORK
An input/encoder-RNN processes the input sequence of length nx
and computes a fixed-length context vector C, usually the final
hidden state or simple function of the hidden states.
One time step after the other information from the input sequence
is processed, added to the hidden state and passed forward in
time through the recurrent connections between hidden states in
the encoder.
The context vector summarizes important information from the
input sequence, e.g. the intent of a question in a question
answering task or the meaning of a text in the case of machine
translation.
The decoder RNN uses this information to predict the output, a
sequence of length ny , which could vary from nx .

Deep Learning – 18 / 27
ENCODER-DECODER NETWORK
In machine translation, the decoder is a language model with
recurrent connections between the output at one time step and the
hidden state at the next time step as well as recurrent connections
between the hidden states:
ny
Y
P(y [1] , . . . , y [yn ] |x[1] , . . . , x[xn ] ) = p(y [t ] |C ; y [1] , . . . , y [t −1] )
t =1

with C being the context-vector.


This architecture is now jointly trained to minimize the translation
error given a source sentence.
Each conditional probability is then

p(y [t ] |y [1] , . . . , y [t −1] ; C ) = f (y [t −1] , g [t ] , C )

where f is a non-linear function, e.g. the tanh and g [t ] is the hidden


state of the decoder network.
Deep Learning – 19 / 27
More application examples

Deep Learning – 20 / 27
SOME MORE SOPHISTICATED APPLICATIONS

Figure: Neural Machine Translation (seq2seq): Sequence to Sequence


Learning with Neural Networks (Ilya Sutskever et al. 2014). As we saw
earlier, an encoder converts a source sentence into a “meaning” vector
which is passed through a decoder to produce a translation.
Deep Learning – 21 / 27
SOME MORE SOPHISTICATED APPLICATIONS

Figure: Neural Machine Translation (seq2seq): Sequence to Sequence


Learning with Neural Networks (Ilya Sutskever et al. 2014). As we saw
earlier, an encoder converts a source sentence into a “meaning” vector
which is passed through a decoder to produce a translation.
Deep Learning – 21 / 27
SOME MORE SOPHISTICATED APPLICATIONS

Figure: Generating Sequences With Recurrent Neural Networks (Alex Graves,


2013). Top row are real data, the rest are generated by various RNNs.

Deep Learning – 22 / 27
SOME MORE SOPHISTICATED APPLICATIONS

Figure: Convolutional and recurrent nets for detecting emotion from audio
data (Namrata Anand & Prateek Verma, 2016). We already had this example
in the CNN chapter!

Deep Learning – 23 / 27
SOME MORE SOPHISTICATED APPLICATIONS

Figure: Visually Indicated Sounds (Andrew Owens et al. 2016). A model to


synthesize plausible impact sounds from silent videos. Click here

Deep Learning – 24 / 27
REFERENCES
Ian Goodfellow, Yoshua Bengio and Aaron Courville (2016)
Deep Learning
http://www.deeplearningbook.org/
Oriol Vinyals, Alexander Toshev, Samy Bengio and Dumitru Erhan (2014)
Show and Tell: A Neural Image Caption Generator
https://arxiv.org/abs/1411.4555
Alex Graves (2013)
Generating Sequences With Recurrent Neural Networks
https://arxiv.org/abs/1308.0850
Namrata Anand and Prateek Verma (2016)
Convolutional and recurrent nets for detecting emotion from audio data
http://cs231n.stanford.edu/reports/2015/pdfs/Cs_231n_paper.pdf
Gabriel Loye (2019)
Attention Mechanism
https://blog.oydhub.com/attention-mechanism/

Deep Learning – 25 / 27
REFERENCES
Andrew Owens, Phillip Isola, Josh H. McDermott, Antonio Torralba, Edward H.
Adelson and William T. Freeman (2015)
Visually Indicated Sounds
https://arxiv.org/abs/1512.08512
Andrej Karpathy (2015)
The Unreasonable Effectiveness of Recurrent Neural Networks
http://karpathy.github.io/2015/05/21/rnn-eectiveness/
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan
Salakhutdinov, Richard S. Zemel and Yoshua Bengio (2015)
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
https://arxiv.org/abs/1502.03044
Shaojie Bai, J. Zico Kolter, Vladlen Koltun (2018)
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for
Sequence Modeling
https://arxiv.org/abs/1803.01271

Deep Learning – 26 / 27
REFERENCES
Lilian Weng (2018)
Attention? Attention!
https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html

Deep Learning – 27 / 27
Deep Learning

Attention and Transformers

Learning goals
Familiarize with the most recent
sequence data modeling
technique:
Attention Mechanism
Transformers
Get to know the CNN alternative
to RNNs
Attention

Deep Learning – 1 / 22
WAHT IS ATTENTION
Humans process data by actively shifting their focus:
Different parts of an image carry different information
Words derive their specific meaning from contex
Remember specific, related events in the past
Allows to follow one thought at a time while suppressing
information irrelevant to the task
Example: cocktail party problem

Deep Learning – 2 / 22
WAHT IS ATTENTION

Figure: The encoder-decoder model for translation.


(source:https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html)

In a classical decoder-encoder RNN all information about the input


sequence must be incorporated into the final hidden state, which is
then passed as an input to the decoder network.
With a long input sequence this fixed-sized context vector is
unlikely to capture all relevant information about the past.
Each hidden state contains mostly information from recent inputs.

Deep Learning – 3 / 22
WAHT IS ATTENTION

Figure: The encoder-decoder model for translation.


(source:https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html)

Different parts of input related to different parts of output.


Encoding complete content difficult (even for LSTMs).
Issue: context vector hT provides no access to earlier inputs!

Deep Learning – 4 / 22
WAHT IS ATTENTION
Key idea: Allow the decoder to access all the hidden states of the
encoder (instead of just the final one) so that it can dynamically
decide which ones are relevant at each time-step in the decoding.
This means the decoder can choose to "focus" on different hidden
states (of the encoder) at different time-steps of the decoding
process similar to how the human eye can focus on different
regions of the visual field.
This is known as an attention mechanism.

Deep Learning – 5 / 22
WAHT IS ATTENTION
The attention mechanism is implemented by an additional
component in the decoder.
For example, this can be a simple single-hidden layer feed-forward
neural network which is trained along with the RNN.
At any given time-step i of the decoding process, the network
computes the relevance of encoder state z[j ] as:

rel (z[j ] )[i ] = v>


a tanh(Wa [g
[i −1] [j ]
; z ])
where va and Wa are the parameters of the feed-forward network,
g[i −1] is the decoder state from the previous time-step and ’;’
indicates concatenation.
The relevance scores (for all the encoder hidden states) are then
normalized which gives the attention weights (α[j ] )[i ] :

exp(rel (z[j ] )[i ] )


(α[j ] )[i ] = P [j 0 ] [i ]
j 0 exp(rel (z ) )

Deep Learning – 6 / 22
WAHT IS ATTENTION
The attention mechanism allows the decoder network to focus on
different parts of the input sequence by adding connections from
all hidden states of the encoder to each hidden state of the
decoder.

Figure: Attention at i = t + 1

Deep Learning – 7 / 22
WAHT IS ATTENTION
At each time step i, a set of weights (α[j ] )[i ] is computed which
determine how to combine the hidden states of the encoder into a
Pn
context vector g[i ] = j =x 1 (α[j ] )[i ] z[j ] , which holds the necessary
information to predict the correct output.
Each hidden state contains mostly information from recent inputs.
In the case of a bidirectional RNN to encode the input sequence, a
hidden state contains information from recent preceding and
following inputs.

Deep Learning – 8 / 22
WAHT IS ATTENTION

Figure: Attention at i = t + 2

Deep Learning – 9 / 22
WAHT IS ATTENTION

Credit: Gabriel Loye

Figure: An illustration of a machine translation task using an encoder-decoder model


with an attention mechanism. The attention weights at each time-step of the
decoding/translation process indicate which parts of the input sequence are most
relevant. (There are 4 attention weights because there are 4 encoder states.)

Deep Learning – 10 / 22
ATTENTION

Figure: Attention for image captioning: the attention mechanism tells the
network roughly which pixels to pay attention to when writing the text (Kelvin
Xu al. 2015)

Deep Learning – 11 / 22
Transformers

Deep Learning – 12 / 22
TRANSFORMERS
Advanced RNNs have similar limitations as vanilla RNN networks:
RNNs process the input data sequentially.
Difficulties in learning long term dependency (although GRU
or LSTM perform better than vanilla RNNs, they sometimes
struggle to remember the context introduced earlier in long
sequences).
These challenges are tackled by transformer networks.

Deep Learning – 13 / 22
TRANSFORMERS
Transformers are solely based on attention (no RNN or CNN).
In fact, the paper which coined the term transformer is called
Attention is all you need.
They are the state-of-the-art networks in natural language
processing (NLP) tasks since 2017.
Transformer architectures like BERT (Bidirectional Encoder
Representations from Transformers, 2018) and GPT-3 (Generative
Pre-trained Transformer-3, 2020) are pre-trained on a large corpus
and can be fine-tuned to specific language tasks.

Deep Learning – 14 / 22
TRANSFORMERS

Deep Learning – 15 / 22
CNNs or RNNs?

Deep Learning – 16 / 22
CNNS OR RNNS?
Historically, RNNs were the default for sequence processing tasks.
However, some families of CNNs (especially those based on Fully
Convolutional Networks (FCNs)) can be used to process
variable-length sequences such as text or time-series data.
If a CNN doesn’t contain any fully-connected layers, the total
number of weights in the network is independent of the spatial
dimensions of the input because of weight-sharing in the
convolutional layers.
Recent research [Bai et al. , 2018] indicates that such
convolutional architectures, so-called Temporal Convolutional
Networks (TCNs), can outperform RNNs on a wide range of tasks.
A major advantage of TCNs is that the entire input sequence can
be fed to the network at once (as opposed to sequentially).

Deep Learning – 17 / 22
CNNS OR RNNS?

Figure: A TCN (we have already seen this in the CNN lecture!) is simply a variant of
the one-dimensional FCN which uses a special type of dilated convolutions called
causal dilated convolutions.

Deep Learning – 18 / 22
SUMMARY
RNNs are specifically designed to process sequences of varying
lengths.
For that recurrent connections are introduced into the network
structure.
The gradient is calculated by backpropagation through time.
An LSTM replaces the simple hidden neuron by a complex system
consisting of cell state, and forget, input, and output gates.
An RNN can be used as a language model, which can be
improved by word-embeddings.
Different advanced types of RNNs exist, like Encoder-Decoder
architectures and bidirectional RNNs.1

1. A bidirectional RNN processes the input sequence in both directions (front-to-back and back-to-front).

Deep Learning – 19 / 22
REFERENCES
Ian Goodfellow, Yoshua Bengio and Aaron Courville (2016)
Deep Learning
http://www.deeplearningbook.org/
Oriol Vinyals, Alexander Toshev, Samy Bengio and Dumitru Erhan (2014)
Show and Tell: A Neural Image Caption Generator
https://arxiv.org/abs/1411.4555
Alex Graves (2013)
Generating Sequences With Recurrent Neural Networks
https://arxiv.org/abs/1308.0850
Namrata Anand and Prateek Verma (2016)
Convolutional and recurrent nets for detecting emotion from audio data
http://cs231n.stanford.edu/reports/2015/pdfs/Cs_231n_paper.pdf
Gabriel Loye (2019)
Attention Mechanism
https://blog.oydhub.com/attention-mechanism/

Deep Learning – 20 / 22
REFERENCES
Andrew Owens, Phillip Isola, Josh H. McDermott, Antonio Torralba, Edward H.
Adelson and William T. Freeman (2015)
Visually Indicated Sounds
https://arxiv.org/abs/1512.08512
Andrej Karpathy (2015)
The Unreasonable Effectiveness of Recurrent Neural Networks
http://karpathy.github.io/2015/05/21/rnn-eectiveness/
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan
Salakhutdinov, Richard S. Zemel and Yoshua Bengio (2015)
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
https://arxiv.org/abs/1502.03044
Shaojie Bai, J. Zico Kolter, Vladlen Koltun (2018)
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for
Sequence Modeling
https://arxiv.org/abs/1803.01271

Deep Learning – 21 / 22
REFERENCES
Lilian Weng (2018)
Attention? Attention!
https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html

Deep Learning – 22 / 22
Deep Learning

Unsupervised Learning

Learning goals
Unsupervised learning tasks
Unsupervised deep learning
UNSUPERVISED LEARNING

So far, we have described the application of neural networks to


supervised learning in which we have labeled training data
(x (1) , y (1) ), . . . , (x (n) , y (n) ).
In supervised learning scenarios we exploit label information (i.e.
class memberships or numeric values) to train our algorithm.
The model learns a function to map x to y .
Examples are: classification, regression, object detection,
semantic segmentation, image captioning, etc.

Deep Learning – 1 / 8
UNSUPERVISED LEARNING
In unsupervised learning scenarios training data consists of
unlabeled input points x (1) , . . . , x (n) .
Our goal is to learn some underlying hidden structure of the data.
Examples are: clustering, dimensionality reduction, feature
learning, density estimation, etc.

Deep Learning – 2 / 8
UNSUPERVISED LEARNING - EXAMPLES

1. Clustering.

Figure: Cluster analysis results for different algorithms. Different clusters are
indicated by different colors. (Source : Wikipedia)

Deep Learning – 3 / 8
UNSUPERVISED LEARNING - EXAMPLES
2. Dimensionality reduction/manifold learning.
E.g. for visualisation in a low dimensional space.

Figure: Principal Component Analysis (PCA)

Deep Learning – 4 / 8
UNSUPERVISED LEARNING - EXAMPLES
2. Dimensionality reduction/manifold learning.
E.g. for image compression.

Figure: from https://de.slideshare.net/hcycon/bildkompression

Deep Learning – 5 / 8
UNSUPERVISED LEARNING - EXAMPLES

3. Feature extraction/representation learning.

Figure: Source: Wikipedia

E.g. for semi-supervised learning: features learned from an


unlabeled dataset are employed to improve performance in a
supervised setting.

Deep Learning – 6 / 8
UNSUPERVISED LEARNING - EXAMPLES
4. Density fitting/learning a generative model.

Figure: A generative model can reconstruct the missing portions of the


images. (Bornschein, Shabanian, Fischer & Bengio, ICML, 2016)

Deep Learning – 7 / 8
UNSUPERVISED DEEP LEARNING
Given i.i.d. (unlabeled) data x1 , x2 , . . . , xn ∼ pdata , in unsupervised
deep learning, one usually trains :
an autoencoder (a special kind of neural network) for
representation learning (feature extraction, dimensionality
reduction, manifold learning, ...), or,

Deep Learning – 8 / 8
UNSUPERVISED DEEP LEARNING
Given i.i.d. (unlabeled) data x1 , x2 , . . . , xn ∼ pdata , in unsupervised
deep learning, one usually trains :
an autoencoder (a special kind of neural network) for
representation learning (feature extraction, dimensionality
reduction, manifold learning, ...), or,
a generative model, i.e. a probabilistic model of the data
generating distribution pdata (data generation, outlier detection,
missing feature extraction, reconstruction, denoising or planning in
reinforcement learning, ...).

Deep Learning – 8 / 8
Deep Learning

Autoencoders - Basic Principle

Learning goals
Task and structure of an AE
Undercomplete AEs
Relation of AEs and PCA
AUTOENCODER-TASK AND STRUCTURE

Autoencoders (AEs) are NNs for unsupervised learning of a lower


dimensional feature representation from unlabeled training data.
Task: Learn a compression of the data.
Autoencoders consist of two parts:
encoder learns mapping from the data x to a
low-dimensional latent variable z = enc (x).
decoder learns mapping back from latent z to a
reconstruction x̂ = dec (z) of x.
Loss function does not use any labels and measures the quality of
the reconstruction compared to the input:

L (x, dec (enc (x)))

Goal: Learn good representation z (also called code).

Deep Learning – 1 / 15
AUTOENCODER (AE)- COMPUTATIONAL GRAPH
The general structure of an AE as a computational graph:

An AE has two computational steps:


the encoder enc, mapping x to z.
the decoder dec, mapping z to x̂.

Deep Learning – 2 / 15
Undercomplete Autoencoders

Deep Learning – 3 / 15
UNDERCOMPLETE AUTOENCODERS
A naive implementation of an autoencoder would simply learn the
identity dec (enc (x)) = ^
x.
This would not be useful.

Deep Learning – 4 / 15
UNDERCOMPLETE AUTOENCODERS

A naive implementation of an autoencoder would simply learn the


identity dec (enc (x)) = ^
x.
This would not be useful.

Deep Learning – 4 / 15
UNDERCOMPLETE AUTOENCODERS
Therefore we have a “bottleneck” layer: We restrict the
architecture, such that
dim(z) < dim(x)
Such an AE is called undercomplete.

Deep Learning – 4 / 15
UNDERCOMPLETE AUTOENCODERS
In an undercomplete AE, the hidden layer has fewer neurons than
the input layer.
→ That will force the AE to
capture only the most salient features of the training data!
learn a “compressed” representation of the input.

Deep Learning – 4 / 15
UNDERCOMPLETE AUTOENCODERS

Training an AE is done by minimizing the risk with a loss function


penalizing the reconstruction dec (enc (x)) for differing from x.
The L2-loss
kx − dec (enc (x))k22
is a typical choice, but other loss functions are possible.
For optimization, the same optimization techniques as for standard
feed-forward nets are applied (SGD, RMSProp, ADAM,...).

Deep Learning – 5 / 15
EXPERIMENT: LEARN TO ENCODE MNIST

Let us try to compress the MNIST data as good as possible.


We train undercomplete AEs with different dimensions of the
internal representation z (.i.e. different “bottleneck” sizes).

Figure: Flow chart of our our autoencoder: reconstruct the input with fixed
dimensions dim(z) ≤ dim(x).

Deep Learning – 6 / 15
EXPERIMENT: LEARN TO ENCODE MNIST

Figure: Architecture of the autoencoder.

Deep Learning – 7 / 15
EXPERIMENT: LEARN TO ENCODE MNIST

Figure: The top row shows the original digits, the bottom row the
reconstructed ones.
dim(z) = 784 = dim(x).

Deep Learning – 8 / 15
EXPERIMENT: LEARN TO ENCODE MNIST

Figure: The top row shows the original digits, the bottom row the
reconstructed ones.
dim(z) = 256.

Deep Learning – 8 / 15
EXPERIMENT: LEARN TO ENCODE MNIST

Figure: The top row shows the original digits, the bottom row the
reconstructed ones.
dim(z) = 64.

Deep Learning – 8 / 15
EXPERIMENT: LEARN TO ENCODE MNIST

Figure: The top row shows the original digits, the bottom row the
reconstructed ones.
dim(z) = 32.

Deep Learning – 8 / 15
EXPERIMENT: LEARN TO ENCODE MNIST

Figure: The top row shows the original digits, the bottom row the
reconstructed ones.
dim(z) = 16.

Deep Learning – 8 / 15
EXPERIMENT: LEARN TO ENCODE MNIST

Figure: The top row shows the original digits, the bottom row the
reconstructed ones.
dim(z) = 8.

Deep Learning – 8 / 15
EXPERIMENT: LEARN TO ENCODE MNIST

Figure: The top row shows the original digits, the bottom row the
reconstructed ones.
dim(z) = 4.

Deep Learning – 8 / 15
EXPERIMENT: LEARN TO ENCODE MNIST

Figure: The top row shows the original digits, the bottom row the
reconstructed ones.
dim(z) = 2.

Deep Learning – 8 / 15
EXPERIMENT: LEARN TO ENCODE MNIST

Figure: The top row shows the original digits, the bottom row the
reconstructed ones.
dim(z) = 1.

Deep Learning – 8 / 15
INCREASING THE CAPACTIY OF AES
Increasing the number of layers adds capacity to autoencoders:

Deep Learning – 9 / 15
Autoencoders as Principal Component
Analysis

Deep Learning – 10 / 15
AES AS PRINCIPAL COMPONENT ANALYSIS

Consider a undercomplete autoencoder with


linear encoder function enc (x), and
linear decoder function dec (z ).
The L2-loss kx − dec (enc (x))k22 is employed and inputs are
normalized to zero mean.
We want to find the linear projection of the data with the minimal
L2-reconstruction error.

Deep Learning – 11 / 15
AES AS PRINCIPAL COMPONENT ANALYSIS
It can be shown that the optimal solution is an orthogonal linear
transformation (i.e. a rotation of the coordinate system) given by
the dim(z ) = k singular vectors with largest singular values.

Deep Learning – 12 / 15
AES AS PRINCIPAL COMPONENT ANALYSIS
This is an equivalent formulation to Principal Component
Analysis (PCA), which uses an orthogonal transformation to
convert a set of observations of possibly correlated variables into a
set of values of linearly uncorrelated variables called principal
components.
The transformation is defined in such a way that the first principal
component has the largest possible variance (i.e., accounts for as
much of the variability in the data as possible).

Deep Learning – 13 / 15
AES AS PRINCIPAL COMPONENT ANALYSIS
The formulations are equivalent: “Find a linear projection into a
k -dimensional space that ...”
“... minimizes the L2-reconstruction error” (AE-based
formulation).
“... maximizes the variance of the projected datapoints”
(statistical formulation).
An AE with a non-linear decoder/encoder can be seen as a
non-linear generalization of PCA.

Figure: Credits: Jeremy Jordan “Introduction to autoencoders”

Deep Learning – 14 / 15
REFERENCES
Ian Goodfellow, Yoshua Bengio and Aaron Courville (2016)
Deep Learning
http: // www. deeplearningbook. org/

Deep Learning – 15 / 15
Deep Learning

Regularized Autoencoders

Learning goals
Overcomplete AEs
Sparse AEs
Denoising AEs
Contractive AEs
Overcomplete Autoencoders

Deep Learning – 1 / 20
OVERCOMPLETE AE – PROBLEM
Overcomplete AE (code dimension ≥ input dimension): even a linear
AE can copy the input to the output without learning anything useful.
How can an overcomplete AE be useful?

Figure: Overcomplete AE that learned to copy its inputs to the hidden layer
and then to the output layer (Credits to M. Ponti).

Deep Learning – 2 / 20
REGULARIZED AUTOENCODER

Goal: choose code dimension and capacity of encoder/decoder


based on the problem.
Regularized AEs modify the original loss function to:
prevent the network from trivially copying the inputs.
encourage additional properties.
Examples:
Sparse AE: sparsity of the representation.
Denoising AE: robustness to noise.
Contractive AE: small derivatives of the representation
w.r.t. input.
⇒ A regularized AE can be overcomplete and nonlinear but still learn
something useful about the data distribution!

Deep Learning – 3 / 20
Sparse Autoencoder

Deep Learning – 4 / 20
SPARSE AUTOENCODER

Idea: Regularization with a sparsity constraint

L(x, dec (enc (x))) + λkz k1

Try to keep the number of active neurons per training input low.
Forces the model to respond to unique statistical features of the
input data.

Figure: Sparse Autoencoder (Credits to M. Ponti).

Deep Learning – 5 / 20
Denoising Autoencoders

Deep Learning – 6 / 20
DENOISING AUTOENCODERS (DAE)
The denoising autoencoder (DAE) is an autoencoder that receives a
corrupted data point as input and is trained to predict the original,
uncorrupted data point as its output.
Idea: representation should be robust to introduction of noise.
Produce corrupted version x̃ of input x, e.g. by
random assignment of subset of inputs to 0.
adding Gaussian noise.
Modified reconstruction loss: L(x, dec (enc (x̃)))
→ denoising AEs must learn to undo this corruption.

Deep Learning – 7 / 20
DENOISING AUTOENCODERS (DAE)
With the corruption process, we induce stochasticity into the DAE.
Formally: let C (x̃|x) present the conditional distribution of
corrupted samples x̃, given a data sample x.
Like feedforward NNs can model a distribution over targets p(y|x),
output units and loss function of an AE can be chosen such that
one gets a stochastic decoder pdecoder (x|z).
E.g. linear output units to parametrize the mean of Gaussian
distribution for real valued x and negative log-likelihood loss (which
is equal to MSE).
The DAE then learns a reconstruction distribution preconstruct (x|x̃)
from training pairs (x, x̃).
(Note that the encoder could also be made stochastic, modelling
pencoder (z|x̃).)

Deep Learning – 8 / 20
DENOISING AUTOENCODERS (DAE)
The general structure of a DAE as a computational graph:

Figure: Denoising autoencoder: “making the learned representation robust to


partial corruption of the input pattern.”

Deep Learning – 9 / 20
DENOISING AUTOENCODERS (DAE)

Figure: Denoising autoencoders - “manifold perspective” (Ian Goodfellow et


al. (2016))

A DAE is trained to map a corrupted data point x̃ back to the original


data point x.

Deep Learning – 10 / 20
DENOISING AUTOENCODERS (DAE)

Figure: Denoising autoencoders - “manifold perspective” (Ian Goodfellow et


al. (2016))

The corruption process C (x̃|x) is displayed by the gray circle of


equiprobable corruptions
Training a DAE by minimizing ||dec (enc (x̃)) − x||2 corresponds to
minimizing Ex,x̃∼pdata (x)C (x̃|x) [− log pdecoder (x|f (x̃))].

Deep Learning – 11 / 20
DENOISING AUTOENCODERS (DAE)

Figure: Denoising autoencoders - “manifold perspective” (Ian Goodfellow et


al. (2016))

The vector dec (enc (x̃)) − x̃ points approximately towards the


nearest point in the data manifold, since dec (enc (x̃)) estimates
the center of mass of clean points x which could have given rise to
x̃.
Thus, the DAE learns a vector field dec (enc (x̃)) − x indicated by
the green arrows.

Deep Learning – 12 / 20
DENOISING AUTOENCODERS (DAE)
An example of a vector field learned by a DAE.

Figure: source: Ian Goodfellow et al. (2016)

Deep Learning – 13 / 20
EXPERIMENT: ENCODE MNIST WITH A DAE

We will now corrupt the MNIST data with Gaussian noise and then
try to denoise it as good as possible.

Figure: Flow chart of our autoencoder: denoise the corrupted input.

Deep Learning – 14 / 20
EXPERIMENT: ENCODE MNIST WITH A DAE

To corrupt the input, we randomly add or subtract values from a


uniform distribution to each of the image entries.

Figure: Top row: original data, bottom row: corrupted mnist data.

Deep Learning – 15 / 20
EXPERIMENT: ENCODE MNIST WITH A DAE

Figure: The top row shows the original digits, the intermediate one the
corrupted and the bottom row the denoised/reconstructed digits (prediction).
dim(z ) = 1568 (overcomplete).

Deep Learning – 16 / 20
EXPERIMENT: ENCODE MNIST WITH A DAE

Figure: The top row shows the original digits, the intermediate one the
corrupted and the bottom row the denoised/reconstructed digits (prediction).
dim(z ) = 784 (= dim(x)).

Deep Learning – 16 / 20
EXPERIMENT: ENCODE MNIST WITH A DAE

Figure: The top row shows the original digits, the intermediate one the
corrupted and the bottom row the denoised/reconstructed digits (prediction).
dim(z ) = 256.

Deep Learning – 16 / 20
EXPERIMENT: ENCODE MNIST WITH A DAE

Figure: The top row shows the original digits, the intermediate one the
corrupted and the bottom row the denoised/reconstructed digits (prediction).
dim(z ) = 64.

Deep Learning – 16 / 20
EXPERIMENT: ENCODE MNIST WITH A DAE

Figure: The top row shows the original digits, the intermediate one the
corrupted and the bottom row the denoised/reconstructed digits (prediction).
dim(z ) = 32.

Deep Learning – 16 / 20
EXPERIMENT: ENCODE MNIST WITH A DAE

Figure: The top row shows the original digits, the intermediate one the
corrupted and the bottom row the denoised/reconstructed digits (prediction).
dim(z ) = 16.

Deep Learning – 16 / 20
EXPERIMENT: ENCODE MNIST WITH A DAE

Figure: The top row shows the original digits, the intermediate one the
corrupted and the bottom row the denoised/reconstructed digits (prediction).
dim(z ) = 8.

Deep Learning – 16 / 20
Contractive Autoencoder

Deep Learning – 17 / 20
CONTRACTIVE AUTOENCODER

Goal: For very similar inputs, the learned encoding should also be
very similar.
We can train our model in order for this to be the case by requiring
that the derivative of the hidden layer activations are small with
respect to the input.
In other words: The encoded state enc (x) should not change
much for small changes in the input.
Add explicit regularization term to the reconstruction loss:
∂ enc (x) 2
L(x, dec (enc (x)) + λk ∂ x kF

Deep Learning – 18 / 20
DAE VS. CAE

DAE CAE
the decoder function is trained the encoder function is trained
to resist infinitesimal perturba- to resist infinitesimal perturba-
tions of the input. tions of the input.

Both the denoising and contractive autoencoders perform well.


Advantage of denoising autoencoder: simpler to implement
requires adding one or two lines of code to regular AE.
no need to compute Jacobian of hidden layer.
Advantage of contractive autoencoder: gradient is deterministic
can use second order optimizers (conjugate gradient, LBFGS,
etc.).
might be more stable than the denoising autoencoder, which
uses a sampled gradient.

Deep Learning – 19 / 20
REFERENCES
Ian Goodfellow, Yoshua Bengio and Aaron Courville (2016)
Deep Learning
http: // www. deeplearningbook. org/
Everything you wanted to know about Deep Learning for Computer Vision but
were afraid to ask (2017)
SIBGRAPI Tutorials 2017

Deep Learning – 20 / 20
Deep Learning

Specific Autoencoders and Applications

Learning goals
convolutional AEs
applications of AEs
CONVOLUTIONAL AUTOENCODER (CONVAE)

For the image domain, using convolutions is advantageous. Can


we also make use of them in AEs?
In a ConvAE, the encoder consists of convolutional layers. The
decoder, on the other hand, consists of transpose convolution
layers or simple upsampling operations.

Deep Learning – 1 / 6
CONVOLUTIONAL AUTOENCODER (CONVAE)

Figure: Potential architecture of a convolutional autoencoder.

We now apply this architecture to denoise MNIST.

Deep Learning – 2 / 6
CONVOLUTIONAL AUTOENCODER (CONVAE)

Figure: Top row: noised data, second row: AE with dim(z ) = 32 (roughly 50k
params), third row: ConvAE (roughly 25k params), fourth row: ground truth.

Deep Learning – 3 / 6
REAL-WORLD APPLICATIONS
Today, autoencoders are still used for tasks such as:
data de-noising,
compression,
and dimensionality reduction for the purpose of visualization.

Deep Learning – 4 / 6
REAL-WORLD APPLICATIONS
Medical image denoising using convolutional denoising autoencoders

Figure: Top row : real image, second row : noisy version, third row : results of
a (convolutional) denoising autoencoder and fourth row : results of a median
filter (Lovedeep Gondara (2016))

Deep Learning – 5 / 6
REAL-WORLD APPLICATIONS
AE-based image compression.

Figure: from Theis et al.

Deep Learning – 6 / 6
Deep Learning

Manifold learning

Learning goals
manifold hypothesis
manifold learning with AEs
Manifold hypothesis: Data of interest lies on an embedded
non-linear manifold within the higher-dimensional space.
A manifold:
is a topological space that locally resembles the Euclidean
space.
in ML, more loosely refers to a connected set of points that
can be approximated well by considering only a small number
of dimensions.

Figure: from Goodfellow et. al

Deep Learning – 1 / 6
An important characterization of a manifold is the set of its tangent
planes.
Definition: At a point x on a d-dimensional manifold, the tangent
plane is given by d basis vectors that span the local directions of
variation allowed on the manifold.

Figure: A pictorial representation of the tangent space of a single point, x, on


a manifold (Goodfellow et al. (2016)).

Deep Learning – 2 / 6
Manifold hypothesis does not need to hold true.
In the context of AI tasks (e.g. processing images, sound, or text) it
seems to be at least approximately correct, since :
probability distributions over images, text strings, and sounds
that occur in real life are highly concentrated (randomly
sampled pixel values do not look like images, randomly
sampling letters is unlikely to result in a meaningful sentence).
samples are connected to each other by other samples, with
each sample surrounded by other highly similar samples that
can be reached by applying transformations (E.g. for images:
Dim or brighten the lights, move or rotate objects, change the
colors of objects, etc).

Deep Learning – 3 / 6
LEARNING MANIFOLDS WITH AES

AEs training procedures involve a compromise between two


forces:
1 Learning a representation z of a training example x such that
x can be approximately recovered from z through a decoder.
2 Satisfying an architectural constraint or regularization penalty.
Together, they force the hidden units to capture information about
the structure of the data generating distribution
important principle: AEs can afford to represent only the variations
that are needed to reconstruct training examples.
If the data-generating distribution concentrates near a
low-dimensional manifold, this yields representations that implicitly
capture a local coordinate system for the manifold.

Deep Learning – 4 / 6
LEARNING MANIFOLDS WITH AES
Only the variations tangent to the manifold around x need to
correspond to changes in z = enc (x). Hence the encoder learns a
mapping from the input space to a representation space that is
only sensitive to changes along the manifold directions, but that is
insensitive to changes orthogonal to the manifold.

Figure: from Goodfellow et al. (2016)

Deep Learning – 5 / 6
LEARNING MANIFOLDS WITH AES
Common setting: a representation (embedding) for the points on
the manifold is learned.
Two different approaches
1 Non-parametric methods: learn an embedding for each
training example.
2 Learning a more general mapping for any point in the input
space.
AI problems can have very complicated structures that can be
difficult to capture from only local interpolation.
⇒ Motivates use of distributed representations and deep
learning!

Deep Learning – 6 / 6
Deep Learning

Introduction to Generative Models

Learning goals
learning a generative model
examples of generative models
WHICH FACE IS FAKE?

Deep Learning – 1 / 10
DEEP UNSUPERVISED LEARNING
There are two main goals of deep unsupervised learning:
Representation Learning
Examples are: manifold learning, feature learning, etc.
Can be done by an autoencoder
Examples of applications:
dimensionality reduction / data compression
transfer learning / semi-supervised learning
Generative Models
Given a training set D = (x(1) , . . . , x(n) ) where each
x(i ) ∼ Px , the goal is to estimate Px .
Goal: Take as input training samples from some distribution
and learn a model that represents that distribution!
Examples of applications:
generating music, videos, volumetric models for 3D
printing, synthetic data for learning algorithms, outlier
identification, images denoising, inpainting, etc.
Deep Learning – 2 / 10
DENSITY FITTING / LEARNING A GENERATIVE
MODEL
Given D = x(1) , x(2) , . . . , x(n) ∼ Px learn a model of Px (for


example, fitting a Gaussian distribution via Maximum Likelihood


Estimation).

Deep Learning – 3 / 10
DENSITY FITTING / LEARNING A GENERATIVE
MODEL
Given D = x(1) , x(2) , . . . , x(n) ∼ Px learn a model of Px (for


example, fitting a Gaussian distribution via Maximum Likelihood


Estimation).

Deep Learning – 3 / 10
WHY GENERATIVE MODELS?
Generative model are capable of uncovering underlying latent variables
in a dataset and can be used for
sampling / data generation
outlier detection
missing feature extraction
image denoising / reconstruction
representation learning
planning in reinforcement learning
...

Deep Learning – 4 / 10
APPLICATION EXAMPLE: IMAGE GENERATION

Source: Karras et al. (2018)

Figure: Synthetic faces generated by a Generative Adversarial Network (more


on this later).

Deep Learning – 5 / 10
APPLICATION EXAMPLE: NEURAL STYLE
TRANSFER
A photograph is “redrawn” in the style of another image! (Gatys et al.,
2015)

Figure: Examples generated on https://deepart.io/. The image on the


left has been generated by translating the original image (middle) to the style
of the image on the right.

Deep Learning – 6 / 10
APPLICATION EXAMPLE: NEURAL STYLE
TRANSFER
A photograph is “redrawn” in the style of another image! (Gatys et al.,
2015)

Figure: Examples generated on https://deepart.io/. The image on the


left has been generated by translating the original image (middle) to the style
of the image on the right.

Deep Learning – 6 / 10
APPLICATION EXAMPLE: NEURAL STYLE
TRANSFER
A photograph is “redrawn” in the style of another image! (Gatys et al.,
2015)

Figure: Examples generated on https://deepart.io/. The image on the


left has been generated by translating the original image (middle) to the style
of the image on the right.

Deep Learning – 6 / 10
APPLICATION EXAMPLE: IMAGE INPAINTING

Source: Demir et al (2018)

Figure: A generative model fills in the missing portion of the image based on
the surrounding context.

Deep Learning – 7 / 10
APPLICATION EXAMPLE: SEMANTIC LABELS –>
IMAGES

Source: Wang et al (2017)

Deep Learning – 8 / 10
APPLICATION EXAMPLE: GENERATING IMAGES
FROM TEXT

Source: Zhang et al (2017)

Deep Learning – 9 / 10
REFERENCES
Ugur Demir, Gozde Unal (2018)
Patch-Based Image Inpainting with Generative Adversarial Networks
https: // arxiv. org/ abs/ 1803. 07422
Tero Karras, Timo Aila, Samuli Laine, Jaakko Lehtinen (2018)
Progressive Growing of GANs for Improved Quality, Stability, and Variation
https: // arxiv. org/ abs/ 1710. 10196
Leon A. Gatys et al. (2015)
Neural Algorithm of Artistic Style
https: // arxiv. org/ abs/ 1508. 06576

Deep Learning – 10 / 10
Deep Learning

Probabilistic graphical models

Learning goals
probabilistic graphical models
latent variables
directed graphical models
Probabilistic graphical models

Deep Learning – 1 / 10
GRAPHICAL MODELS

Probabilistic graphical models describe probability


distributions by mapping conditional dependence and
independence properties between random variables on a graph
structure.

Deep Learning – 2 / 10
WHY AGAIN GRAPHICAL MODELS?
1 Graphical models visualize the
structure of a probabilistic model; they
help to develop, understand and
motivate probabilistic models.

Deep Learning – 3 / 10
WHY AGAIN GRAPHICAL MODELS?
1 Graphical models visualize the
structure of a probabilistic model; they
help to develop, understand and
motivate probabilistic models.

2 Complex computations (e.g.,


marginalization) can derived efficiently
using algorithms exploiting the graph
structure.

Deep Learning – 3 / 10
GRAPHICAL MODELS: EXAMPLE

Credit: Daphne Koller

Figure: A graphical model representing five variables and their (in-)dependencies


along with the corresponding marginal and conditional distributions. The variable
’Grade’, for example, is affected by ’Difficulty’ (of the exam) and ’Intelligence’ (of the
student). This is captured in the corresponding conditional distribution. ’Letter’ refers to
a letter of recommendation. In this model, ’Letter’ is conditionally independent of
’Difficulty’ and ’Intelligence’, given ’Grade’.
Deep Learning – 4 / 10
Latent Variables

Deep Learning – 5 / 10
LATENT VARIABLES: MOTIVATION

Figure: A simple illustration of the relevance of latent variables. Here, six 200 x 200
pixel images are shown where each pixel is either black or white. Naively, the
probability distribution over the space of all such images would need 240000 − 1
parameters to fully specify. However, we see that the images have three main factors of
variation : object type (shape), position and size. This suggests that the actual number
of parameters required might be significantly fewer.

Deep Learning – 6 / 10
LATENT VARIABLES

Additional nodes, which do not directly correspond to


observations, allow to describe complex distributions over the
visible variables by means of simple conditional distributions.
The corresponding random variables are called hidden or latent
variables.

Figure: ’Object’, ’position’ and ’size’ are the latent variables behind an image.

Deep Learning – 7 / 10
Directed generative models

Deep Learning – 8 / 10
DIRECTED GENERATIVE MODELS
Goal: Learn to generate x from some latent variables z
Z Z
pθ (x) = pθ (x, z)dz = pθ (x|z)pθ (z)dz

Image from: Ward, A. D., Hamarneh, G.: 3D Surface Parameterization Using Manifold Learning for Medial Shape
Representation, Conference on Image Processing, Proc. of SPIE Medical Imaging, 2007

Figure: Left: An illustration of a directed generative model. Right: A mapping


(represented by g) from the 2D latent space to the 3D space of observed variables.

Deep Learning – 9 / 10
DIRECTED GENERATIVE MODELS
The latent variables z must be learned from the data (which only
contains the observed variables x).
pθ (x|z)pθ (z)
The posterior is given by pθ (z|x) = pθ (x)
.
R
But pθ (x) = pθ (x|z)pθ (z)dz is intractable and common
algorithms (such as Expectation Maximization) cannot be used.

Deep Learning – 10 / 10
DIRECTED GENERATIVE MODELS
The latent variables z must be learned from the data (which only
contains the observed variables x).
pθ (x|z)pθ (z)
The posterior is given by pθ (z|x) = pθ (x)
.
R
But pθ (x) = pθ (x|z)pθ (z)dz is intractable and common
algorithms (such as Expectation Maximization) cannot be used.
The classic DAG problem: How do we efficiently learn pθ (z|x)?

Popular approaches to this problem:


Variational Autoencoders (VAEs)
Generative Adversarial Networks (GANs)
Deep Learning – 10 / 10
Deep Learning

Introduction to Generative Adversarial


Networks (GANs)

Learning goals
architecture of a GAN
minimax loss
training a GANN
WHAT IS A GAN?

A generative adversarial network (GAN) consists of two DNNs:


generator
discriminator
Generator transforms random noise vector into fake sample.
Discriminator gets real and fake samples as input and outputs
probability of the input being real.

Deep Learning – 1 / 39
WHAT IS A GAN?

Goal of generator: fool discriminator into thinking that the


synthesized samples are real.
Goal of discriminator: recognize real samples and not being fooled
by generator.
This sets off an arms race. As the generator gets better at
producing realistic samples, the discriminator is forced to get better
at detecting the fake samples which in turn forces the generator to
get even better at producing realistic samples and so on.

Deep Learning – 2 / 39
FAKE CURRENCY ILLUSTRATION
The generative model can be thought of as analogous to a team of
counterfeiters, trying to produce fake currency and use it without
detection, while the discriminative model is analogous to the police,
trying to detect the counterfeit currency. Competition in this game
drives both teams to improve their methods until the counterfeits are
indistinguishable from the genuine articles.
-Ian Goodfellow

Image created by Mayank Vadsola

Deep Learning – 3 / 39
GAN Training

Deep Learning – 4 / 39
MINIMAX LOSS FOR GANS

min max V (D , G) = Ex∼pdata(x) [log D (x)] + Ez∼p(z) [log(1 −


G D
D (G(z)))]

pdata(x) is our target, the data distribution.

The generator is a neural network mapping a latend random vector


z to generated sample G(z). Even if the generator is a
determinisic function, we have random outputs, i.e. variability.

p(z) is usually a uniform distribution or an isotropic Gaussian. It is


typically fixed and not adapted during training.

Deep Learning – 5 / 39
MINIMAX LOSS FOR GANS

min max V (D , G) = Ex∼pdata(x) [log D (x)] + Ez∼p(z) [log(1 −


G D
D (G(z)))]

G(z) is the output of the generator for a given state z of the latent
variables.

D (x) is the output of the discriminator for a real sample x.

D (G(z)) is the output of the discriminator for a fake sample G(z)


synthesized by the generator.

Deep Learning – 6 / 39
MINIMAX LOSS FOR GANS

min max V (D , G) = Ex∼pdata(x) [log D (x)] + Ez∼p(z) [log(1 −


G D
D (G(z)))]

Ex∼pdata(x) [log D (x)] is the log-probability of correctly classifying


real data points as real.

Ez∼p(z) [log(1 − D (G(z)))] is the log-probability of correctly


classifying fake samples as fake.

With each gradient update, the discriminator tries to push D (x)


toward 1 and D (G(z))) toward 0. This is the same as maximizing
V(D,G).

The generator only has control over D (G(z)) and tries to push that
toward 1 with each gradient update. This is the same as
minimizing V(D,G).

Deep Learning – 7 / 39
GAN TRAINING : PSEUDOCODE

Algorithm 1 Minibatch stochastic gradient descent training of GANs.


Amount of training iterations, amount of discriminator updates k
1: for number of training iterations do
2: for k steps do
3: Sample minibatch of m samples {z(1) . . . z(m) } from prior pg (z)
4: Sample minibatch of m examples {x(1) . . . x(m) } from training data
5: Update discriminator by ascending the stochastic gradient:
m 
log D (x(i ) ) + log(1 − D (G(z(i ) )))

∇θd m1
P
i =1
6: end for
7: Sample minibatch of m noise samples {z(1) . . . z(m) } from the noise prior pg (z)
8: Update generator by descending the stochastic gradient:
m
∇θg m1 log(1 − D (G(z(i ) )))
P
i =1
9: end for

Deep Learning – 8 / 39
GAN TRAINING: ILLUSTRATION

GANs are trained by simultaneously updating the discriminative distribution (D, blue, dashed line) so that it discriminates between
samples from the data generating distribution (black,dotted line) px from those of the generative distribution pg (G) (green, solid
line).Source: Goodfellow et al (2017),

For k steps, G’s parameters are frozen and one performs gradient
ascent on D to increase its accuracy.
Finally, D’s parameters are frozen and one performs gradient
descent on G to increase its generation performance.
Note, that G gets to peek at D’s internals (from the
back-propagated errors) but D does not get to peek at G.

Deep Learning – 9 / 39
DIVERGENCE MEASURES

The goal of generative modeling is to learn pdata (x).

The differences between different generative models can be


measured in terms of divergence measures.

A divergence measure quantifies the distance between two


distributions.

There are many different divergence measures that one can us


(e.g. Kullback-Leibler divergence).

All such measures always positive and 0 if and only if the two
distributions are equal to each other.

Deep Learning – 10 / 39
DIVERGENCE MEASURES

One approach to training generative models is to explicitly minimize the


distance between pdata (x) and the model distribution pθ (x) according to
some divergence measure.

If our generator has the capacity to model pdata (x) perfectly, the choice of
divergence does not matter much because they all achieve their
minimum (that is 0) when pg (x) = pdata (x).

However, it is not likely that that the generator, which is parametrized by


the weights of a neural network, is capable of perfectly modelling an
arbitrary pdata (x).

In such a scenario, the choice of divergence measure matters, because


the parameters that miniminize the various divergence measures differ.

Deep Learning – 11 / 39
IMPLICIT DIVERGENCE MEASURE OF GANS

GANs do not explicitly minimize any divergence measure.


However, (under some assumptions!) optimizing the minimax loss
is equivalent to implicitly minimizing a divergence measure.
That is, if the optimal discriminator is found in every iteration, the
generator minimizes the Jensen-Shannon divergence (JSD)
(theorem and proof are given by the original GAN paper
(Goodfellow et al, 2014)):

1 pdata + pg 1 pdata + pg
JS (pdata ||pg ) = KL(pdata || ) + KL(pg || )
2 2 2 2
pdata (x)
KL(pdata ||pg ) = Ex∼pdata (x) [log ]
pg (x)

Deep Learning – 12 / 39
OPTIMAL DISCRIMINATOR
∗ is:
For G fixed, the optimal discriminator DG

Credit: Mark Chang

The optimal discriminator returns a value greater than 0.5 if the


probability to come from the data (pdata (x )) is larger than the
probability to come from the generator (pg (x )).

Deep Learning – 13 / 39
OPTIMAL DISCRIMINATOR
∗ is:
For G fixed, the optimal discriminator DG

Credit: Mark Chang

Note: The optimal solution is almost never found in practice, since


the discriminator has a finite capacity and is trained on a finite
amount of data.
Therefore, the assumption needed to guarantee that the generator
minimizes the JSD does usually not hold in practice.
Deep Learning – 14 / 39
Challenges for GAN Optimization

Deep Learning – 15 / 39
ADVERSARIAL TRAINING

Deep Learning models (in general) involve a single player!


The player tries to maximize its reward (minimize its loss),
Use SGD (with backprob) to find the optimal parameters,
SGD has convergence guarantees (under certain conditions).
However, with non-convexity, we might convert to local minima!

GAN instead involve two players


Discriminator is trying to maximize its reward,
Generator is trying to minimize discriminator’s reward.
SGD was not designed to find the Nash equilibrium of a game!
Therefore, we might not converge to the Nash equilibrium at all!

Deep Learning – 16 / 39
ADVERSARIAL TRAINING -EXAMPLE

Consider the function f (x , y ) = xy , where x and y are both


scalars.
Player A can control x and Player B can control y .
The loss:
Player A: LA (x , y ) = xy
Player B: LB (x , y ) = −xy
This can be rewritten as L(x , y ) = min max xy
x y
What we have here is a simple zero-sum game with its
characteristic minimax loss.
Deep Learning – 17 / 39
POSSIBLE BEHAVIOUR #1: CONVERGENCE

The partial derivatives of the losses are:

∂ LA ∂ LB
=y, = −x
∂x ∂y
In adversarial training, both players perform gradient descent on
their respective losses.
We update x with x − α · y and y with y + α · x simultaneously in
one iteration, where α is the learning rate.

Deep Learning – 18 / 39
POSSIBLE BEHAVIOUR #1: CONVERGENCE

In order for simultaneous gradient descent to converge to a fixed


point, both gradients have to be simultaneously 0.
They are both (simultaneously) zero only for the point (0,0).
This is a saddle point of the function f (x , y ) = xy .
The fixed point for a minimax game is typically a saddle point.
Such a fixed point is an example of a Nash equilibrium.
In adversarial training, convergence to a fixed point is not
guaranteed.

Deep Learning – 19 / 39
POSSIBLE BEHAVIOUR #2: CHAOTIC BEHAVIOUR

Credit: Lilian Weng

Figure: A simulation of our example for updating x to minimize xy and updating y to


minimize -xy. The learning rate α = 0.1. With more iterations, the oscillation grows
more and more unstable.

Once x and y have different signs, every following gradient update


causes huge oscillation and the instability gets worse in time, as
shown in the figure.

Deep Learning – 20 / 39
POSSIBLE BEHAVIOUR #3: CYCLES

Credit: Goodfellow

Figure: Simultaneous gradient descent with an infinitesimal step size can result in a
circular orbit in the parameter space.

A discrete example: A never-ending game of Rock-Paper-Scissors


where player A chooses ’Rock’ → player B chooses ’Paper’ → A
chooses ’Scissors’ → B chooses ’Rock’ → ...
Takeaway: Adversarial training is highly unpredictable. It can get
stuck in cycles or become chaotic.

Deep Learning – 21 / 39
NON-STATIONARY LOSS SURFACE

From the perspective of one of the players, the loss surface


changes every time the other player makes a move.

This is in stark contrast to (full batch) gradient descent where the


loss surface is stationary no matter how many iterations of gradient
descent are performed.

Deep Learning – 22 / 39
ILLUSTRATION OF CONVERGENCE

Credit: Mark Chang

Deep Learning – 23 / 39
ILLUSTRATION OF CONVERGENCE: FINAL STEP

Credit: Mark Chang

Such convergence is not guaranteed, however.

Deep Learning – 24 / 39
CHALLENGES FOR GAN TRAINING

Non-convergence: the model parameters oscillate, destabilize and


never converge,
Mode collapse: the generator collapses which produces limited
varieties of samples,
Diminished gradient: the discriminator gets too successful that the
generator gradient vanishes and learns nothing,
Unbalance between the generator and discriminator causing
overfitting,
Highly sensitive to the hyperparameter selections.

Deep Learning – 25 / 39
GAN variants

Deep Learning – 26 / 39
NON-SATURATING LOSS

Credit: Daniel Seita

Figure: Various generator loss functions (J (G) ).

It was discovered that a relatively strong discriminator could


completely dominate the generator.
When optimizing the minimax loss, as the discriminator gets good
at identifying fake images, i.e. as D (G(z)) approaches 0, the
gradient with respect to the generator parameters vanishes.
Deep Learning – 27 / 39
NON-SATURATING LOSS

Credit: Daniel Seita

Figure: Various generator loss functions (J (G) ).

Solution: Use a non-saturating generator loss instead:


J (G) = − 12 E~z ∼p(~z ) [log D (G(x))]
In contrast to the minimax loss, when the discriminator gets good
at identifying fake images, the magnitude of the gradient of J (G)
increases and the generator is able to learn to produce better
images in successive iterations.
Deep Learning – 28 / 39
OTHER LOSS FUNCTIONS
Various losses for GAN training with different properties have been
proposed:

Source: Lucic et al. 2016

Deep Learning – 29 / 39
ARCHITECTURE-VARIANT GANS

Motivated by different challenges in GAN training procedure described,


there have been several types of architecture variants proposed.
Understanding and improving GAN training is a very active area of
research.

Credit: hindupuravinash
Deep Learning – 30 / 39
GAN APPLICATION
What kinds of problems can GANs address?
Generation
Conditional Generation
Clustering
Semi-supervised Learning
Representation Learning
Translation
Any traditional discriminative task can be approached with
generative models

Deep Learning – 31 / 39
CONDITIONAL GANS: MOTIVATION

In an ordinary GAN, the only thing that is fed to the generator are
the latent variables z.
A conditional GAN allows you to condition the generative model on
additional variables.
E.g. a generator conditioned on text input (in addition to z) can be
trained to generate the image described by the text.

Deep Learning – 32 / 39
CONDITIONAL GANS: ARCHITECTURE

Credit: Guim Perarnau

In a conditional GAN, additional information in the form of vector ~


y
is fed to both the generator and the discriminator.
~z can then encode all variations in ~z that are not encoded by ~y .
E.g. ~y could encode the class of a hand-written number (from 0 to
9). Then, ~z could encode the style of the number (size, weight,
rotation, etc).

Deep Learning – 33 / 39
CONDITIONAL GANS: EXAMPLE

Source: Mirza et al. 2014

Figure: When the model is conditioned on a one-hot coded class label, it generates
random images that belong (mostly) to that particular class. The randomness here
comes from the randomly sampled z. (Note : z is implicit. It is not shown above.)

Deep Learning – 34 / 39
CONDITIONAL GANS: MORE EXAMPLES

Source: Isola et al. 2016

Figure: Conditional GANs can translate images of one type to another. In each of the
4 examples above, the image on the left is fed to the network and the image on the
right is generated by the network.
Deep Learning – 35 / 39
MORE GENERATIVE MODELS

Today, we learned about two kinds of (directed) generative models:


Variational Autoencoders (VAEs)
Generative Adversarial Networks (GANs).
There are other interesting generative models, e.g.:
autoregressive models
restricted Boltzmann machines.
Note:
It is important to bear in mind that generative models are not
a solved problem.
There are many interesting hybrid models that combine two
or more of these approaches.

Deep Learning – 36 / 39
REFERENCES
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David
Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio (2014)
Generative Adversarial Networks
https: // arxiv. org/ abs/ 1406. 2661
Santiago Pascual, Antonio Bonafonte, Joan Serra (2017)
SEGAN: Speech Enhancement Generative Adversarial Network
https: // arxiv. org/ abs/ 1703. 09452
Ian Goodfellow (2016)
NIPS 2016 Tutorial: Generative Adversarial Networks
https: // arxiv. org/ abs/ 1701. 00160
Lilian Weng (2017)
From GAN to WGAN
https: // lilianweng. github. io/ lil-log/ 2017/ 08/ 20/
from-GAN-to-WGAN. html

Deep Learning – 37 / 39
REFERENCES
Mark Chang (2016)
Generative Adversarial Networks
https: // www. slideshare. net/ ckmarkohchang/
generative-adversarial-networks
Lucas Theis, Aaron van den Oord, Matthias Bethge (2016)
A note on the evaluation of generative models
https: // arxiv. org/ abs/ 1511. 01844
Aiden Nibali (2016)
The GAN objective, from practice to theory and back again
https: // aiden. nibali. org/ blog/ 2016-12-21-gan-objective/
Mehdi Mirza, Simon Osindero (2014)
Conditional Generative Adversarial Nets
https: // arxiv. org/ abs/ 1411. 1784

Deep Learning – 38 / 39
REFERENCES
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, Alexei A. Efros (2016)
Image-to-Image Translation with Conditional Adversarial Networks
https: // arxiv. org/ abs/ 1611. 07004
Guim Perarnau (2017)
Fantastic GANs and where to find them
https: // guimperarnau. com/ blog/ 2017/ 03/
Fantastic-GANs-and-where-to-find-them

Deep Learning – 39 / 39
Deep Learning

GAN variants

Learning goals
non-saturating loss
conditional GANs
NON-SATURATING LOSS

Credit: Daniel Seita

Figure: Various generator loss functions (J (G) ).

It was discovered that a relatively strong discriminator could


completely dominate the generator.
When optimizing the minimax loss, as the discriminator gets good
at identifying fake images, i.e. as D (G(z)) approaches 0, the
gradient with respect to the generator parameters vanishes.
Deep Learning – 1 / 12
NON-SATURATING LOSS

Credit: Daniel Seita

Figure: Various generator loss functions (J (G) ).

Solution: Use a non-saturating generator loss instead:


J (G) = − 12 E~z ∼p(~z ) [log D (G(x))]
In contrast to the minimax loss, when the discriminator gets good
at identifying fake images, the magnitude of the gradient of J (G)
increases and the generator is able to learn to produce better
images in successive iterations.
Deep Learning – 2 / 12
OTHER LOSS FUNCTIONS
Various losses for GAN training with different properties have been
proposed:

Source: Lucic et al. 2016

Deep Learning – 3 / 12
ARCHITECTURE-VARIANT GANS
Motivated by different challenges in GAN training procedure described,
there have been several types of architecture variants proposed.
Understanding and improving GAN training is a very active area of
research.

Credit: hindupuravinash

Deep Learning – 4 / 12
CONDITIONAL GANS: MOTIVATION

In an ordinary GAN, the only thing that is fed to the generator are
the latent variables z.
A conditional GAN allows you to condition the generative model on
additional variables.
E.g. a generator conditioned on text input (in addition to z) can be
trained to generate the image described by the text.

Deep Learning – 5 / 12
CONDITIONAL GANS: ARCHITECTURE

Credit: Guim Perarnau

In a conditional GAN, additional information in the form of vector y


is fed to both the generator and the discriminator.
z can then encode all variations in z that are not encoded by y.
E.g. y could encode the class of a hand-written number (from 0 to
9). Then, z could encode the style of the number (size, weight,
rotation, etc).

Deep Learning – 6 / 12
CONDITIONAL GANS: EXAMPLE

Source: Mirza et al. 2014

Figure: When the model is conditioned on a one-hot coded class label, it generates
random images that belong (mostly) to that particular class. The randomness here
comes from the randomly sampled z. (Note : z is implicit. It is not shown above.)

Deep Learning – 7 / 12
CONDITIONAL GANS: MORE EXAMPLES

Source: Isola et al. 2016

Figure: Conditional GANs can translate images of one type to another. In each of the
4 examples above, the image on the left is fed to the network and the image on the
right is generated by the network.
Deep Learning – 8 / 12
MORE GENERATIVE MODELS

Today, we learned about one kind of (directed) generative models:


There are other interesting generative models, e.g.:
autoregressive models
restricted Boltzmann machines.
Note:
It is important to bear in mind that generative models are not
a solved problem.
There are many interesting hybrid models that combine two
or more of these approaches.

Deep Learning – 9 / 12
REFERENCES
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David
Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio (2014)
Generative Adversarial Networks
https: // arxiv. org/ abs/ 1406. 2661
Santiago Pascual, Antonio Bonafonte, Joan Serra (2017)
SEGAN: Speech Enhancement Generative Adversarial Network
https: // arxiv. org/ abs/ 1703. 09452
Ian Goodfellow (2016)
NIPS 2016 Tutorial: Generative Adversarial Networks
https: // arxiv. org/ abs/ 1701. 00160
Lilian Weng (2017)
From GAN to WGAN
https: // lilianweng. github. io/ lil-log/ 2017/ 08/ 20/
from-GAN-to-WGAN. html

Deep Learning – 10 / 12
REFERENCES
Mark Chang (2016)
Generative Adversarial Networks
https: // www. slideshare. net/ ckmarkohchang/
generative-adversarial-networks
Lucas Theis, Aaron van den Oord, Matthias Bethge (2016)
A note on the evaluation of generative models
https: // arxiv. org/ abs/ 1511. 01844
Aiden Nibali (2016)
The GAN objective, from practice to theory and back again
https: // aiden. nibali. org/ blog/ 2016-12-21-gan-objective/
Mehdi Mirza, Simon Osindero (2014)
Conditional Generative Adversarial Nets
https: // arxiv. org/ abs/ 1411. 1784

Deep Learning – 11 / 12
REFERENCES
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, Alexei A. Efros (2016)
Image-to-Image Translation with Conditional Adversarial Networks
https: // arxiv. org/ abs/ 1611. 07004
Guim Perarnau (2017)
Fantastic GANs and where to find them
https: // guimperarnau. com/ blog/ 2017/ 03/
Fantastic-GANs-and-where-to-find-them

Deep Learning – 12 / 12
Deep Learning

Challenges for GAN Optimization

Learning goals
(no) convergence to fix point
problems of adversarial setting
ADVERSARIAL TRAINING

Deep Learning models (in general) involve a single player!


The player tries to maximize its reward (minimize its loss).
Use SGD (with backprob) to find the optimal parameters.
SGD has convergence guarantees (under certain conditions).
However, with non-convexity, we might convert to local minima!

GAN instead involve two players


Discriminator is trying to maximize its reward.
Generator is trying to minimize discriminator’s reward.
SGD was not designed to find the Nash equilibrium of a game!
Therefore, we might not converge to the Nash equilibrium at all!

Deep Learning – 1 / 11
ADVERSARIAL TRAINING -EXAMPLE

Consider the function f (x , y ) = xy , where x and y are both


scalars.
Player A can control x and Player B can control y .
The loss:
Player A: LA (x , y ) = xy
Player B: LB (x , y ) = −xy
This can be rewritten as L(x , y ) = min max xy
x y
What we have here is a simple zero-sum game with its
characteristic minimax loss.
Deep Learning – 2 / 11
POSSIBLE BEHAVIOUR #1: CONVERGENCE

The partial derivatives of the losses are:

∂ LA ∂ LB
=y, = −x
∂x ∂y
In adversarial training, both players perform gradient descent on
their respective losses.
We update x with x − α · y and y with y + α · x simultaneously in
one iteration, where α is the learning rate.

Deep Learning – 3 / 11
POSSIBLE BEHAVIOUR #1: CONVERGENCE

In order for simultaneous gradient descent to converge to a fixed


point, both gradients have to be simultaneously 0.
They are both (simultaneously) zero only for the point (0,0).
This is a saddle point of the function f (x , y ) = xy .
The fixed point for a minimax game is typically a saddle point.
Such a fixed point is an example of a Nash equilibrium.
In adversarial training, convergence to a fixed point is not
guaranteed.

Deep Learning – 4 / 11
POSSIBLE BEHAVIOUR #2: CHAOTIC BEHAVIOUR

Credit: Lilian Weng

Figure: A simulation of our example for updating x to minimize xy and updating y to


minimize -xy. The learning rate α = 0.1. With more iterations, the oscillation grows
more and more unstable.

Once x and y have different signs, every following gradient update


causes huge oscillation and the instability gets worse in time, as
shown in the figure.

Deep Learning – 5 / 11
POSSIBLE BEHAVIOUR #3: CYCLES

Credit: Goodfellow

Figure: Simultaneous gradient descent with an infinitesimal step size can result in a
circular orbit in the parameter space.

A discrete example: A never-ending game of Rock-Paper-Scissors


where player A chooses ’Rock’ → player B chooses ’Paper’ → A
chooses ’Scissors’ → B chooses ’Rock’ → ...
Takeaway: Adversarial training is highly unpredictable. It can get
stuck in cycles or become chaotic.

Deep Learning – 6 / 11
NON-STATIONARY LOSS SURFACE

From the perspective of one of the players, the loss surface


changes every time the other player makes a move.

This is in stark contrast to (full batch) gradient descent where the


loss surface is stationary no matter how many iterations of gradient
descent are performed.

Deep Learning – 7 / 11
ILLUSTRATION OF CONVERGENCE

Credit: Mark Chang

Deep Learning – 8 / 11
ILLUSTRATION OF CONVERGENCE: FINAL STEP

Credit: Mark Chang

Such convergence is not guaranteed, however.

Deep Learning – 9 / 11
CHALLENGES FOR GAN TRAINING

Non-convergence: the model parameters oscillate, destabilize and


never converge.
Mode collapse: the generator collapses which produces limited
varieties of samples.
Diminished gradient: the discriminator gets too successful that the
generator gradient vanishes and learns nothing.
Unbalance between the generator and discriminator causing
overfitting.
Highly sensitive to the hyperparameter selections.

Deep Learning – 10 / 11
REFERENCES
Ian Goodfellow (2016)
NIPS 2016 Tutorial: Generative Adversarial Networks
https: // arxiv. org/ abs/ 1701. 00160
Lilian Weng (2017)
From GAN to WGAN
https: // lilianweng. github. io/ lil-log/ 2017/ 08/ 20/
from-GAN-to-WGAN. html
Mark Chang (2016)
Generative Adversarial Networks
https: // www. slideshare. net/ ckmarkohchang/
generative-adversarial-networks

Deep Learning – 11 / 11

You might also like