0% found this document useful (0 votes)

31 views915 pages

Deep Learning Final Sheet

A single neuron, or perceptron, performs a weighted sum of its input values plus a bias term, and applies an activation function to output a value, representing different functions depending on the choice of activation. It is the basic computational unit of neural networks, with the input features represented by nodes connecting to weights that feed into the output neuron. Adding a constant input node allows the bias term to be absorbed into the weight vector for a graphical representation of the computation.

Uploaded by

mdsahadathossain456

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views915 pages

Deep Learning Final Sheet

Uploaded by

mdsahadathossain456

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 915

Deep Learning

Introduction
Learning goals
Bird!

No bird!

...
“Oh, it’s a bird!” Relationship of DL and ML

... ...
some magic Concept of representation or
feature learning
... ... Use-cases and data types for DL
methods
WHAT IS DEEP LEARNING

Artificial Machine Deep

Intelligence Learning Learning

Deep learning is a subield of ML based on artificial neural

networks.

Deep Learning – 1 / 12
DEEP LEARNING AND NEURAL NETWORKS

Deep learning itself is not new:

Neural networks have been around since the 70s.
Deep neural networks, i.e., networks with multiple hidden
layers, are not much younger.

Why everybody is talking about deep learning now:

1 Specialized, powerful hardware allows training of huge neural
networks to push the state-of-the-art on difficult problems.
2 Large amount of data is available.
3 Special network architectures for image/text data.
4 Better optimization and regularization strategies.

Deep Learning – 2 / 12
IMAGE CLASSIFICATION WITH NEURAL
NETWORKS
“Machine learning algorithms, inspired by the brain, based on
learning multiple levels of representation/abstraction.”
Y. Bengio

“Oh, it’s a bird!” “Oh, it’s a bird!”

“No bird!!”

some magic

... ...
x1 x2 x3

Deep Learning – 3 / 12
IMAGE CLASSIFICATION WITH NEURAL
NETWORKS
“Machine learning algorithms, inspired by the brain, based on
learning multiple levels of representation/abstraction.”
Y. Bengio

“Oh, it’s a bird!” “Oh, it’s a bird!”

“No bird!!”

some magic

... ...

Deep Learning – 3 / 12
IMAGE CLASSIFICATION WITH NEURAL
NETWORKS
“Machine learning algorithms, inspired by the brain, based on
learning multiple levels of representation/abstraction.”
Y. Bengio

Bird!

No bird!

“Oh, it’s a bird!”

...

some magic
... ...

... ...

Deep Learning – 3 / 12
IMAGE CLASSIFICATION WITH NEURAL
NETWORKS
“Machine learning algorithms, inspired by the brain, based on
learning multiple levels of representation/abstraction.”
Y. Bengio

Bird!

No bird!

“Oh, it’s a bird!”

...

some magic
... ...

... ...

Deep Learning – 3 / 12
POSSIBLE USE-CASES
Deep learning can be extremely valuable if the data has these
properties:

It is high dimensional.
Each single feature itself is not very informative but only a
combination of them might be.
There is a large amount of training data.

This implies that for tabular data, deep learning is rarely the
correct model choice.

Without extensive tuning, models like random forests or gradient

boosting will outperform deep learning most of the time.
One exception is data with categorical features with many levels.

Deep Learning – 4 / 12
POSSIBLE USE-CASE: IMAGES
High Dimensional: A color image with 255 × 255 (3 Colors)
pixels already has 195075 features.
Informative: A single pixel is not meaningful in itself.
Training Data: Depending on applications huge amounts of data
are available.

Architecture: Convolutional Neural Networks (CNN)

Deep Learning – 5 / 12
POSSIBLE USE-CASE: IMAGES

Credit: Alex Krizhevsky (2009)

Image classification tries to predict a single label for each image.

CIFAR-10 is a well-known dataset used for image classification. It consists of 60, 000
32x32 color images containing one of 10 object classes, with 6000 images per class.

Deep Learning – 6 / 12
POSSIBLE USE-CASE: IMAGES

Credit: Kaiming He (2017)

Object Detection Mask R-CNN is a general framework for instance segmentation,

that efficiently detects objects in an image while simultaneously generating a
high-quality segmentation mask for each instance.
Deep Learning – 7 / 12
POSSIBLE USE-CASE: IMAGES

Credit: Hyeonwoo Noh (2015)

Image segmentation partitions the image into (multiple) segments.

Deep Learning – 8 / 12
POSSIBLE USE-CASE: TEXT
High Dimensional: Each word can be a single feature (300000
words in the German language).
Informative: A single word does not provide much context.
Training Data: Huge amounts of text data available.

Architecture: Recurrent Neural Networks (RNN)

Deep Learning – 9 / 12
POSSIBLE USE-CASE: TEXT CLASSIFICATION

Sentiment Analysis is the application of natural language processing

to systematically identify the emotional and subjective information in
texts.

Deep Learning – 10 / 12
POSSIBLE USE-CASE: TEXT

Machine Translation (e.g. google translate) Neural machine translation

exploits neural networks to predict the likelihood of a sequence of
words, typically modeling entire sentences in a single integrated model.

Deep Learning – 11 / 12
APPLICATIONS OF DEEP LEARNING: SPEECH

Speech Recognition and Generation (e.g. google assistant) Neural

network extracts features from audio data for downstream tasks, e.g., to
classify emotions in speech.

Deep Learning – 12 / 12
Deep Learning

Single Neuron / Perceptron

Learning goals
Graphical representation of a
single neuron
Affine transformations and
non-linear activation functions
Hypothesis spaces of a single
neuron
Typical loss functions
A SINGLE NEURON

Perceptron with input features x1 , x2 , ..., xp , weights w1 , w2 , ..., wp , bias term b, and
activation function τ .

The perceptron is a single artificial neuron and the basic

computational unit of neural networks.

It is a weighted sum of input values, transformed by τ :

f (x ) = τ (w1 x1 + ... + wp xp + b) = τ (wT x + b)

Deep Learning – 1 / 11
A SINGLE NEURON
Activation function τ : a single neuron represents different functions
depending on the choice of activation function.

The identity function gives us the simple linear regression:

f (x ) = τ (wT x) = wT x

The logistic function gives us the logistic regression:

1
f (x ) = τ (wT x) =
1 + exp(−wT x)

Deep Learning – 2 / 11
A SINGLE NEURON
We consider a perceptron with 3-dimensional input, i.e.
f (x) = τ (w1 x1 + w2 x2 + w3 x3 + b).
Input features x are represented by nodes in the “input layer”.

In general, a p-dimensional input vector x will be represented by p

nodes in the input layer.

Deep Learning – 3 / 11
A SINGLE NEURON
Weights w are connected to edges from the input layer.

The bias term b is implicit here. It is often not visualized as a

separate node.

Deep Learning – 4 / 11
A SINGLE NEURON
For an explicit graphical representation, we do a simple trick:
Add a constant feature to the inputs x̃ = (1, x1 , ..., xp )T
and absorb the bias into the weight vector w̃ = (b, w1 , ..., wp ).
The graphical representation is then:

Deep Learning – 5 / 11
A SINGLE NEURON
The computation τ (w1 x1 + w2 x2 + w3 x3 + b) is represented by the
neuron in the “output layer”.

Deep Learning – 6 / 11
A SINGLE NEURON
You can picture the input vector being "fed" to neurons on the left
followed by a sequence of computations performed from left to
right. This is called a forward pass.

Deep Learning – 7 / 11
A SINGLE NEURON
A neuron performs a 2-step computation:
1 Affine Transformation: weighted sum of inputs plus bias.

2 Non-linear Activation: a non-linear transformation applied to the

weighted sum.

Deep Learning – 8 / 11
A SINGLE NEURON: HYPOTHESIS SPACE
The hypothesis space that is formed by single neuron is
p
( ! )
X
p p
H = f : R → R f (x) = τ w j xj + b , w ∈ R , b ∈ R .
j =1

If τ is the logistic sigmoid or identity function, H corresponds to

the hpothesis space of logistic or linear regression, respectively.

Figure: Left: A regression line learned by a single neuron. Right: A

decision-boundary learned by a single neuron in a binary classification task.

Deep Learning – 9 / 11
A SINGLE NEURON: OPTIMIZATION

To optimize this model, we minimize the empirical risk

n
1 X (i ) (i )
Remp = L y ,f x ,
n
i =1

where L (y , f (x)) is a loss function. It compares the network’s

predictions f (x) to the ground truth y .
For regression, we typically use the L2 loss (rarely L1):

1
L (y , f (x)) = (y − f (x))2
2
For binary classification, we typically apply the cross entropy loss
(also known as Bernoulli loss):

L (y , f (x)) = −(y log f (x) + (1 − y ) log(1 − f (x)))

Deep Learning – 10 / 11
A SINGLE NEURON: OPTIMIZATION
For a single neuron and both choices of τ the loss function is
convex.
The global optimum can be found with an iterative algorithm like
gradient descent.
A single neuron with logistic sigmoid function trained with the
Bernoulli loss yields the same result as logistic regression when
trained until convergence.
Note: In the case of regression and the L2-loss, the solution can
also be found analytically using the “normal equations”. However,
in other cases a closed-form solution is usually not available.

Deep Learning – 11 / 11
Deep Learning

XOR-Problem

Learning goals
Example problem a single
neuron can not solve but a single
hidden layer net can
EXAMPLE: XOR PROBLEM

Suppose we have four data points

X = {(0, 0)> , (0, 1)> , (1, 0)> , (1, 1)> }

The XOR gate (exclusive or) returns true, when an odd number of
inputs are true:

x1 x2 XOR = y
0 0 0
0 1 1
1 0 1
1 1 0

Can you learn the target function with a logistic regression model?

Deep Learning – 1 / 10
EXAMPLE: XOR PROBLEM
Logistic regression can not
solve this problem. In fact,
any model using simple
hyperplanes for separation
can not (including a single
neuron).

A small neural net can

easily solve the problem by
transforming the space!

Deep Learning – 2 / 10
EXAMPLE: XOR PROBLEM
Consider the following model:

Figure: A neural network with two neurons in the hidden layer. The matrix W
describes the mapping from x to z. The vector u from z to y .

Deep Learning – 3 / 10
EXAMPLE: XOR PROBLEM
Let use ReLU σ(z ) = max {0, z } as activation function
( and a
1 if z > 0
simple thresholding function τ (z ) = [z > 0] =
0 otherwise
as output transformation function. We can represent the
architecture of the model by the following equation:

f (x | θ) = f (x | W, b, u, c ) = τ u> σ(W> x + b) + c

= τ u> max{0, W> x + b} + c

So how many parameters does our model have?

In a fully connected neural net, the number of connections
between the nodes equals our parameters:

(2 × 2) + (2 × 1) + (2 × 1) + (1) = 9
| {z } | {z } | {z } |{z}
W b u c

Deep Learning – 4 / 10
EXAMPLE: XOR PROBLEM

1 1 0 1
Let W = , b= , u= , c = −0.5
1 1 −1 −2

     
0 0 0 0 0 −1
0 1 1 1 1 0 
X=
1
 , XW = 
1 1 , XW + B = 1 0 
  
0
1 1 2 2 2 1

Note: X is a (n × p) design matrix in which the rows correspond to the data points. W,
as usual, is a (p × m) matrix where each column corresponds to a single (hidden)
neuron. B is a (n × m) matrix with b duplicated along the rows.

Deep Learning – 5 / 10
EXAMPLE: XOR PROBLEM

1 1 0 1
Let W = , b= , u= , c = −0.5
1 1 −1 −2

     
0 0 0 0 0 −1
0 1 1 1 1 0 
X=
1
 , XW = 
1 1 , XW + B = 1 0 
  
0
1 1 2 2 2 1

 
0 0
1 0
Z = max{0, XW + B } = 
1

0
2 1

Note that we computed all examples at once.

Deep Learning – 6 / 10
EXAMPLE: XOR PROBLEM

The input points are

mapped into transformed
space to
 
0 0
1 0
Z =
1

0
2 1

Deep Learning – 7 / 10
EXAMPLE: XOR PROBLEM

The input points are

mapped into transformed
space to
 
0 0
1 0
Z =
1

0
2 1

which is easily separable.

Deep Learning – 8 / 10
EXAMPLE: XOR PROBLEM
In a final step we have to multiply the activated values of matrix Z
with the vector u and add the bias term c:
     
0 0 −0.5 −0.5
1 0 1
  −0.5  0.5 
 
f (x | W, b, u, c ) =  + = 
1 0 −2  −0.5  0.5 
2 1 −0.5 −0.5

And then apply the step function τ (z ) = [z > 0]. This solves the
XOR problem perfectly!

x1 x2 XOR = y
0 0 0
0 1 1
1 0 1
1 1 0

Deep Learning – 9 / 10
NEURAL NETWORKS : OPTIMIZATION

In this simple example we actually “guessed” the values of the

parameters for W , b, u and c.

That won’t work for more sophisticated problems!

We will learn later about iterative optimization algorithms for

automatically adapting weights and biases.

An added complication is that the loss function is no longer

convex. Therefore, there might not exist a single minimum.

Deep Learning – 10 / 10
Deep Learning

Single hidden layer neural networks

Learning goals
Architecture of single hidden
layer neural networks
Representation learning/
understanding the advantage of
hidden layers
Typical (non-linear) activation
functions
MOTIVATION

The graphical way of representing simple functions/models, like

logistic regression. Why is that useful?

Because individual neurons can be used as building blocks of

more complicated functions.

Networks of neurons can represent extremely complex hypothesis

spaces.

Most importantly, it allows us to define the “right” kinds of

hypothesis spaces to learn functions that are common in our
universe in a data-efficient way (see Lin, Tegmark et al. 2016).

Bernd Bischl Deep Learning – 1 / 14

MOTIVATION
Can a single neuron perform binary classification of these points?

Bernd Bischl Deep Learning – 2 / 14

MOTIVATION
As a single neuron is restricted to learning only linear decision
boundaries, its performance on the following task is quite poor:

However, the neuron can easily separate the classes if the original
features are transformed (e.g., from Cartesian to polar
coordinates):

Bernd Bischl Deep Learning – 3 / 14

MOTIVATION
Instead of classifying the data in the original representation,

we classify it in a new feature space.

Bernd Bischl Deep Learning – 4 / 14

MOTIVATION
Analogously, instead of a single neuron,

we use more complex networks.

Bernd Bischl Deep Learning – 5 / 14

REPRESENTATION LEARNING

It is very critical to feed a classifier the “right” features in order for it

to perform well.

Before deep learning took off, features for tasks like machine
vision and speech recognition were “hand-designed” by domain
experts. This step of the machine learning pipeline is called
feature engineering.

DL automates feature engineering. This is called representation

learning.

Bernd Bischl Deep Learning – 6 / 14

SINGLE HIDDEN LAYER NETWORKS
Single neurons perform a 2-step computation:
1 Affine Transformation: a weighted sum of inputs plus bias.
2 Activation: a non-linear transformation on the weighted sum.

Single hidden layer networks consist of two layers (without input

layer):
1 Hidden Layer: having a set of neurons.
2 Output Layer: having one or more output neurons.

Multiple inputs are simultaneously fed to the network.

Each neuron in the hidden layer performs a 2-step computation.

The final output of the network is then calculated by another 2-step

computation performed by the neuron in the output layer.

Bernd Bischl Deep Learning – 7 / 14

SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
Each neuron in the hidden layer performs an affine transformation on
the inputs:

(1)
zin = w11 x (1) + w21 x (2) + w31 x (3) + b1
(1)
zin = 3 ∗ (−3) + (−9) ∗ 1 + 2 ∗ 5 + 5 = −3

Bernd Bischl Deep Learning – 8 / 14

SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
Each neuron in the hidden layer performs an affine transformation on
the inputs:

(2)
zin = w12 x (1) + w22 x (2) + w32 x (3) + b2
(2)
zin = 11 ∗ (−3) + (−2) ∗ 1 + 7 ∗ 5 + 2 = 2

Bernd Bischl Deep Learning – 8 / 14

SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
Each neuron in the hidden layer performs an affine transformation on
the inputs:

(3)
zin = w13 x (1) + w23 x (2) + w33 x (3) + b3
(3)
zin = (−6) ∗ (−3) + 3 ∗ 1 + (−4) ∗ 5 − 1 = 0

Bernd Bischl Deep Learning – 8 / 14

SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
Each neuron in the hidden layer performs an affine transformation on
the inputs:

(4)
zin = w14 x (1) + w24 x (2) + w34 x (3) + b4
(4)
zin = 6 ∗ (−3) + (−1) ∗ 1 + 5 ∗ 5 + 1 = 7

Bernd Bischl Deep Learning – 8 / 14

SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
Each neuron in the hidden layer performs an affine transformation on
the inputs:

Bernd Bischl Deep Learning – 8 / 14

SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
Each hidden neuron performs a non-linear activation transformation on
the weight sum:

(i ) (i ) 1
zout = σ zin = (i )
−z
1+ e in

Bernd Bischl Deep Learning – 9 / 14

SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
The output neuron performs an affine transformation on its inputs:

(1) (2) ( 3) (4)

fin = u1 zout + u2 zout + u3 zout + u4 zout + c

Bernd Bischl Deep Learning – 10 / 14

SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
The output neuron performs an affine transformation on its inputs:

(1) (2) (3) (4)

fin = u1 zout + u2 zout + u3 zout + u4 zout + c
fin = 3 ∗ 0.05 + (−12) ∗ 0.88 + 8 ∗ 0.50 + 1 ∗ 0.99 + 6 = 0.57

Bernd Bischl Deep Learning – 10 / 14

SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
The output neuron performs a non-linear activation transformation on
the weight sum:

fout = σ(fin ) = 1+e1−fin

fout = 1+e1−0.57 = 0.64

Bernd Bischl Deep Learning – 11 / 14

HIDDEN LAYER: ACTIVATION FUNCTION
If the hidden layer does not have a non-linear activation, the
network can only learn linear decision boundaries.
A lot of different activation functions exist.

Bernd Bischl Deep Learning – 12 / 14

HIDDEN LAYER: ACTIVATION FUNCTION

ReLU Activation:

Currently the most popular choice is the ReLU (rectified linear

unit):
σ(v ) = max(0, v )

Bernd Bischl Deep Learning – 13 / 14

HIDDEN LAYER: ACTIVATION FUNCTION
Sigmoid Activation Function:

The sigmoid function can be used even in the hidden layer:

1
σ(v ) =
1 + exp(−v )

Bernd Bischl Deep Learning – 14 / 14

Deep Learning

Single Hidden Layer Networks for

Multi-Class Classification

Learning goals
Neural network architectures for
multi-class classification
Softmax activation function
Softmax loss
MULTI-CLASS CLASSIFICATION

We have only considered regression and binary classification

problems so far.

How can we get a neural network to perform multiclass

classification?

Deep Learning – 1 / 6
MULTI-CLASS CLASSIFICATION
The first step is to add additional neurons to the output layer.
Each neuron in the layer will represent a specific class (number of
neurons in the output layer = number of classes).

Figure: Structure of a single hidden layer, feed-forward neural network for

g-class classification problems (bias term omitted).

Deep Learning – 2 / 6
MULTI-CLASS CLASSIFICATION

Notation:

For g-class classification, g output units:

f = (f1 , . . . , fg )

m hidden neurons z1 , . . . , zm , with

zj = σ(WTj x), j = 1, . . . , m.

Compute linear combinations of derived features z:

fin,k = UkT z, z = (z1 , . . . , zm )T , k = 1, . . . , g

Deep Learning – 3 / 6
MULTI-CLASS CLASSIFICATION

The second step is to apply a softmax activation function to the

output layer.

This gives us a probability distribution over g different possible

classes:
exp(fin,k )
fout ,k = τk (fin,k ) = Pg
k 0 =1 exp(fin,k )
0

This is the same transformation used in softmax regression!

∂τ (fin )
Derivative ∂ fin = diag(τ (fin )) − τ (fin )τ (fin )T

It is a “smooth” approximation of the argmax operation, so

τ ((1, 1000, 2)T ) ≈ (0, 1, 0)T (picks out 2nd element!).

Deep Learning – 4 / 6
MULTI-CLASS CLASSIFICATION: EXAMPLE
Forward pass (Hidden: Sigmoid, Output: Softmax).

Deep Learning – 5 / 6
MULTI-CLASS CLASSIFICATION: EXAMPLE
Forward pass (Hidden: Sigmoid, Output: Softmax).

Deep Learning – 5 / 6
OPTIMIZATION: SOFTMAX LOSS

The loss function for a softmax classifier is

g
X exp(fin,k )
L(y , f (x)) = − [y = k ] log Pg
k 0 =1 exp(fin,k )
0
k =1
(
1 if y = k
where [y = k ] = .
0 otherwise

This is equivalent to the cross-entropy loss when the label vector y

is one-hot coded (e.g. y = (0, 0, 1, 0)T ).
Optimization: Again, there is no analytic solution.

Deep Learning – 6 / 6
Deep Learning

MLP – Multi-Layer Feedforward Neural

Networks

Learning goals
Architectures of deep neural
networks
Deep neural networks as
chained functions
FEEDFORWARD NEURAL NETWORKS

We will now extend the model class once again, such that we allow
an arbitrary amount l of hidden layers.

The general term for this model class is (multi-layer) feedforward

networks (inputs are passed through the network from left to right,
no feedback-loops are allowed)

Deep Learning – 1 / 7
FEEDFORWARD NEURAL NETWORKS
We can characterize those models by the following chain structure:

f (x) = τ ◦ φ ◦ σ (l ) ◦ φ(l ) ◦ σ (l −1) ◦ φ(l −1) ◦ . . . ◦ σ (1) ◦ φ(1)

where σ (i ) and φ(i ) are the activation function and the weighted
sum of hidden layer i, respectively. τ and φ are the corresponding
components of the output layer.

Each hidden layer has:

an associated weight matrix W(i ) , bias b(i ) , and activations

z(i ) for i ∈ {1 . . . l }.
z(i ) = σ (i ) (φ(i ) ) = σ (i ) (W(i )T z(i −1) + b(i ) ) , where z(0) = x.

Again, without non-linear activations in the hidden layers, the

network can only learn linear decision boundaries.

Deep Learning – 2 / 7
FEEDFORWARD NEURAL NETWORKS

Figure: Structure of a deep neural network with l hidden layers (bias terms
omitted).

Deep Learning – 3 / 7
FEEDFORWARD NEURAL NETWORKS: EXAMPLE

Deep Learning – 4 / 7
FEEDFORWARD NEURAL NETWORKS: EXAMPLE

Deep Learning – 4 / 7
WHY ADD MORE LAYERS?

Multiple layers allow for the extraction of more and more abstract
representations.
Each layer in a feed-forward neural network adds its own degree of
non-lnearity to the model.

Figure: An intuitive, geometric explanation of the exponential advantage of

deeper networks formally (Montúfar et al. (2014)).

Deep Learning – 5 / 7
DEEP NEURAL NETWORKS
Neural networks today can have hundreds of hidden layers. The greater
the number of layers, the "deeper" the network. Historically DNNs were
very challenging to train and not popular until the late ’00s for several
reasons:

The use of sigmoid activations (e.g., logistic sigmoid and tanh)

significantly slowed down training due to a phenomenon known as
“vanishing gradients”. The introduction of the ReLU activation
largely solved this problem.
Training DNNs on CPUs was too slow to be practical. Switching
over to GPUs cut down training time by more than an order of
magnitude.
When dataset sizes are small, other models (such as SVMs) and
techniques (such as feature engineering) often outperform them.

Deep Learning – 6 / 7
DEEP NEURAL NETWORKS
The availability of large datasets and novel architectures that are
capable of handling even complex tensor-shaped data (e.g. CNNs
for image data), faster hardware, and better optimization and
regularization methods made it feasible to successfully implement
deep neural networks.

An increase in depth often translates to an increase in

performance on a given task. State-of-the-art neural networks,
however, are much more sophisticated than the simple
architectures we have encountered so far.

The term "deep learning" encompasses all of these developments and

refers to the field as a whole.

Deep Learning – 7 / 7
Deep Learning

MLP – Matrix Notation

Learning goals
Compact representation of
neural network equations
Vector notation for neuron layers
Vector and matrix notation of
bias and weight parameters
SINGLE HIDDEN LAYER NETWORKS: NOTATIONS

The input x is a column vector with dimensions p × 1.

W is a weight matrix with dimensions p × m, where m is the
amount of hidden neurons:
 
w1,1 w1,2 · · · w1,m
w2,1 w2,2 · · · w2,m 
W= .
 
.. .. .. 
 .. . . . 
wp,1 w p ,2 · · · w p ,m

Deep Learning – 1 / 9
SINGLE HIDDEN LAYER NETWORKS: NOTATIONS
Hidden layer:
For example, to obtain z1 , we pick the first column of W :
 
w1,1
w2,1 
W1 =  . 
 
 .. 
wp,1

and compute
z1 = σ(WT1 x + b1 ) ,
where b1 is the bias of the first hidden neuron and σ : R → R is
an activation function.

Deep Learning – 2 / 9
SINGLE HIDDEN LAYER NETWORKS: NOTATION

The network has m hidden neurons z1 , . . . , zm with

zj = σ(WTj x + bj )

zin,j = WTj x + bj

zout ,j = σ(zin,j ) = σ(WTj x + bj )

for j ∈ {1, . . . , m}.

Vectorized notation:
zin = (zin,1 , . . . , zin,m )T = WT x + b
(Note: WT x = (xT W)T )
z = zout = σ(zin ) = σ(WT x + b), where the (hidden layer)
activation function σ is applied element-wise to zin .

Deep Learning – 3 / 9
SINGLE HIDDEN LAYER NETWORKS: NOTATION
Bias term:
We sometimes omit the bias term by adding a constant
feature to the input x̃ = (1, x1 , ..., xp ) and by adding the bias
term to the weight matrix

W̃ = (b, W1 , ..., Wp ).

Note: For simplification purposes, we will not explicitly

represent the bias term graphically in the following. However,
the above “trick” makes it straightforward to represent it
graphically.

Deep Learning – 4 / 9
SINGLE HIDDEN LAYER NETWORKS: NOTATION
Output layer:

For regression or binary classification: one output unit f where

fin = uT z + c , i.e. a linear combination of derived features
plus the bias term c of the output neuron, and
f (x) = fout = τ (fin ) = τ (uT z + c ) , where τ is the output
activation function.
For regression τ is the identity function.
For binary classification, τ is a sigmoid function.
Note: The purpose of the hidden-layer activation function σ is to
introduce non-linearities so that the network is able to learn
complex functions whereas the purpose of τ is merely to get the
final score to the same range as the target.

Deep Learning – 5 / 9
SINGLE HIDDEN LAYER NETWORKS: NOTATION
Multiple inputs:
It is possible to feed multiple inputs to a neural network
simultaneously.

The inputs x(i ) , for i ∈ {1, . . . , n}, are arranged as rows in the
design matrix X.
X is a (n × p)-matrix.

The weighted sum in the hidden layer is now computed as

XW + B, where,
W, as usual, is a (p × m) matrix, and,
B is a (n × m) matrix containing the bias vector b (duplicated)
as the rows of the matrix.

The matrix of hidden activations Z = σ(XW + B )

Z is a (n × m) matrix.

Deep Learning – 6 / 9
SINGLE HIDDEN LAYER NETWORKS: NOTATION
The final output of the network, which contains a prediction for
each input, is τ (Z u + C ), where

u is the vector of weights of the output neuron, and,

C is a (n × 1) matrix whose elements are the (scalar) bias c
of the output neuron.

Deep Learning – 7 / 9
SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
Weights (and biases) of the network.

Deep Learning – 8 / 9
SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
Weights (and biases) of the network.

Deep Learning – 8 / 9
SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
Forward pass through the shallow neural network.

Deep Learning – 9 / 9
SINGLE HIDDEN LAYER NETWORKS: EXAMPLE
Forward pass through the shallow neural network.

Deep Learning – 9 / 9
Deep Learning

Universal Approximation
nnet: size=4; maxit=1e+03
Train: mse=0.029; CV: mse.test.mean=0.055

Learning goals
●
●
●
●

1.0 ●
●
● ●
●
● ●

Universal approximation theorem

● ●
●
● ●
0.5 ● ●
● ●
● ●●
●
● ●
●

●●●
●
● ●
●
for one-hidden-layer neural
y

0.0
●

networks
●
● ●
● ●
●

●
−0.5
●

●
The pros and cons of a low
−1.0
● ●
● ●

●
●
approximation error
0.0 2.5 5.0 7.5 10.0
x
UNIVERSAL APPROXIMATION PROPERTY
Theorem. Let σ : R → R be a continuous, non-constant, bounded,
and monotonically increasing function. Let C ⊂ Rp be compact, and let
C(C ) denote the space of continuous functions C → R. Then, given a
function g ∈ C(C ) and an accuracy ε > 0, there exists a hidden layer
size m ∈ N and a set of coefficients Wj ∈ Rp , uj , bj ∈ R (for
j ∈ {1, . . . , m}), such that
m
X
f : C → R; f (x) = uj · σ WjT x + bj
j =1

is an ε-approximation of g, that is,

∥f − g ∥∞ := max |f (x) − g (x)| < ε .

x ∈C

The theorem extends trivially to multiple outputs.

Deep Learning – 1 / 14
UNIVERSAL APPROXIMATION PROPERTY
Corollary. Neural networks with a single sigmoidal hidden layer and
linear output layer are universal approximators.
This means that for a given
target function g there exists a
sequence of networks fk k ∈N that converges (pointwise) to g.

Usually, as the networks come closer and closer to g, they will

need more and more hidden neurons.

A network with fixed layer sizes can only model a subspace of all
continuous functions.

The continuous functions form an infinite dimensional vector

space. Therefore arbitrarily large hidden layer sizes are needed.

Deep Learning – 2 / 14
UNIVERSAL APPROXIMATION PROPERTY
Why is universal approximation a desirable property?

Recall the definition of a Bayes optimal hypothesis f ∗ : X → Y . It

is the best possible hypothesis (model) for the given problem: it
has minimal loss averaged over the data generating distribution.

So ideally we would like the neural network (or any other learner)
to approximate the Bayes optimal hypothesis.

Usually we do not manage to learn f ∗ .

This is because we do not have enough (infinite) data. We have no

control over this, so we have to live with this limitation.

But we do have control over which model class we use.

Deep Learning – 3 / 14
UNIVERSAL APPROXIMATION PROPERTY
Universal approximation ⇒ approximation error tends to zero as
hidden layer size tends to infinity.

Positive approximation error implies that no matter how big the

data set, we cannot find the optimal model.

This bears the risk of systematic under-fitting, which can be

avoided with a universal model class.

Deep Learning – 4 / 14
UNIVERSAL APPROXIMATION PROPERTY
As we know, there are also good reasons for restricting the model
class.

This is because a flexible model class with universal approximation

ability often results in over-fitting, which is no better than
under-fitting.

Thus, “universal approximation ⇒ low approximation error”, but at

the risk of a substantial learning error.

In general, models of intermediate flexibility give the best

predictions. For neural networks this amounts to a reasonably
sized hidden layer.

Deep Learning – 5 / 14
EXAMPLE : REGRESSION/CLASSIFICATION

Let’s look at a few examples of the types of functions and

decisions boundaries learnt by neural networks (with a single
hidden layer) of various sizes.

"size" here refers to the number of neurons in the hidden layer.

The number of "iterations" in the following slides corresponds to

the number of steps of the applied iterative optimization algorithm
(stochastic gradient descent).

Deep Learning – 6 / 14
REGRESSION EX.: 1000 TRAINING ITERATIONS
nnet: size=1; maxit=1e+03
Train: mse=0.391; CV: mse.test.mean=0.419

●
●
●
●

1.0 ●
●
● ●
●
● ●
● ●
●
● ●
0.5 ● ●
● ●
● ●●
●
● ● ● ●
● ●
●
●●●
y

0.0
●
●
● ●
● ●
●

●
−0.5
●

−1.0
● ●
● ●
●

0.0 2.5 5.0 7.5 10.0

Deep Learning – 7 / 14
REGRESSION EX.: 1000 TRAINING ITERATIONS
nnet: size=2; maxit=1e+03
Train: mse=0.088; CV: mse.test.mean=0.112

●
●
●
●

1.0 ●
●
● ●
●
● ●
● ●
●
● ●
0.5 ● ●
● ●
● ●●
●
● ● ● ●
● ●
●
●●●
0.0
●
y

●
● ●
● ●
●

●
−0.5
●

−1.0
● ●
● ●
●

−1.5

0.0 2.5 5.0 7.5 10.0

Deep Learning – 8 / 14
REGRESSION EX.: 1000 TRAINING ITERATIONS
nnet: size=3; maxit=1e+03
Train: mse=0.032; CV: mse.test.mean=0.063

●
●
●
●

1.0 ●
●
● ●
●
● ●
● ●
●
● ●
0.5 ● ●
● ●
● ●●
●
● ● ● ●
● ●
●
●●●
y

0.0
●
●
● ●
● ●
●

●
−0.5
●

−1.0
● ●
● ●
●

0.0 2.5 5.0 7.5 10.0

Deep Learning – 9 / 14
REGRESSION EX.: 1000 TRAINING ITERATIONS
nnet: size=4; maxit=1e+03
Train: mse=0.029; CV: mse.test.mean=0.055

●
●
●
●

1.0 ●
●
● ●
●
● ●
● ●
●
● ●
0.5 ● ●
● ●
● ●●
●
● ● ● ●
● ●
●
●●●
y

0.0
●
●
● ●
● ●
●

●
−0.5
●

−1.0
● ●
● ●
●

0.0 2.5 5.0 7.5 10.0

Deep Learning – 10 / 14
REGRESSION EX.: 1000 TRAINING ITERATIONS
nnet: size=5; maxit=1e+03
Train: mse=0.028; CV: mse.test.mean=19.845

●
●
●
●

1.0 ●
●
● ●
●
● ●
● ●
●
● ●
0.5 ● ●
● ●
● ●●
●
● ● ● ●
● ●
●
●●●
y

0.0
●
●
● ●
● ●
●

●
−0.5
●

−1.0
● ●
● ●
●

0.0 2.5 5.0 7.5 10.0

Deep Learning – 11 / 14
REGRESSION EX.: 1000 TRAINING ITERATIONS
nnet: size=6; maxit=1e+03
Train: mse=0.031; CV: mse.test.mean=4.374

●
●
●
●

1.0 ●
●
● ●
●
● ●
● ●
●
● ●
0.5 ● ●
● ●
● ●●
●
● ● ● ●
● ●
●
●●●
y

0.0
●
●
● ●
● ●
●

●
−0.5
●

−1.0
● ●
● ●
●

0.0 2.5 5.0 7.5 10.0

Deep Learning – 12 / 14
REGRESSION EX.: 1000 TRAINING ITERATIONS
nnet: size=10; maxit=1e+03
Train: mse=0.023; CV: mse.test.mean=0.698

●
●
●
●

1 ●
●
● ●
●
● ●
● ●
● ●
●
● ●
● ●
● ●●
●
● ● ● ●
● ●
●
●●●
0
●
●
y

● ●
● ●
●
●

−1
● ●
● ●
●

0.0 2.5 5.0 7.5 10.0

Deep Learning – 13 / 14
CLASSIFICATION: 500 TRAINING ITERATIONS
nnet: size=1; maxit=500
Train: mmce=0.336; CV: mmce.test.mean=0.346

●
●● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
1.0 ● ●● ● ●●
● ● ●
● ● ● ●●
● ● ●
●
●
● ●●
● ● ● ● ●
● ●
● ●
● ● ●
● ● ●● ●
● ● ● ● ● ●
0.5 ● ● ●● ● ●● ● ● ●● ● ●●●● ●
●●
● ● ●
● ● ●● ● ● ●
● ● ●● ● ●● ● ●
● ● ● ● ● ●● ● ●
●● ● ●●
●
●
●
● ● ● ● ●●● ●
● ●● ● ●
● ●
● ●● ● ● ●
●
● ●
●
class
● ● ● ●●
● ● ●
● ● ●●● ●● ●
x2

0.0 ● ●
● ● ●
● ● ● 1
●●
● ●
● ● ●● ●
●
● ●●
● ●
● ●
●● ● ● ● ●
● ●●
●
2
● ●●
●●
●
●●
●●●● ●● ●●●●

●●
●●
●
● ●● ●
● ●
●
●●
●●●
● ●● ●
● ● ●
● ●●●
●
● ● ● ●●
● ● ●● ●
● ● ●

● ● ● ● ● ●●●
●
● ●●●
−0.5 ● ●
●●
●● ●●●
●●
●●
●● ●
● ● ●● ●
● ●●●

●●
● ●● ●●
● ●
●
●
● ● ●● ●
● ● ● ● ●
● ● ● ●●
●●
● ●●
● ● ●● ● ● ● ● ● ● ● ●
● ●● ●●
●● ●●
● ● ● ●●● ● ●
●● ● ● ● ●
● ●●●● ● ●● ●●●
● ● ●
● ●●
●● ●
●● ●
●● ● ●●●
●● ●●● ●
●
●●
●●
● ●● ● ●● ●
●●●
●● ●
●
●
●
−1.0

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

Deep Learning – 14 / 14
CLASSIFICATION: 500 TRAINING ITERATIONS
nnet: size=2; maxit=500
Train: mmce=0.426; CV: mmce.test.mean=0.412

●
●● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
●●
1.0 ●
● ●●● ●● ●●●
● ● ●●
● ● ● ●
●
●● ● ●●
●●
● ● ●
●
●
●
● ●
●
●●
●
● ● ●
●●● ●●
●● ●●
●
● ● ● ● ●
● ●
●
● ● ●
●● ●●
●● ● ●
● ● ● ● ● ●● ●●●
● ●
●● ●●
●
●● ●● ●
● ●
0.5 ● ●●
●●●● ● ●
●
● ●● ●
●●
●●
●● ●●
● ●●
●
●● ●
●
● ●
● ● ●●●
●
● ●
●
● ●●
● ● ●
●
●
● ● ●
●● ●
●
●
● ●
●
●
● ●
●● ● ●
● ●
● ● ●●● ●
●●●●●
●
●
● ●
●
●
●
●●
● ●●
●●
●●● ●●
●
●
●
● ●
●● ●
● ●
●
● ●
●
● ●
●
● ●
● ●●
●● ●● ●●
●
●
●
●
●●
●
class
●
● ●
●●
●
●●●
●
●
●
● ●
●
● ●●
● ● ● ●
●
●
● ●
●
● ● ●● ●●
●
● ● ●●
●●
● ●
● ●
x2

0.0 ●
●
● ● ●
●
●
● 1
●
● ●●
● ●
●●
● ●
● ●● ●● ● 2
●●● ●●●
●
●●
● ●
●● ● ●
●
● ●
●
●
●● ●● ●
● ●
● ●
●
●
●● ● ●●
● ● ●
●●
●●
−0.5 ●● ● ●
● ●
●●● ●● ●●●
● ●●
●●
●● ●
● ●
●● ●●
●
● ● ● ●●
● ● ● ●
●
●
●
● ● ● ● ●●
●
● ● ● ●● ●
●
●●
● ● ●
●●● ●
●●
● ●●
● ●
● ●
● ● ●
●● ●
●● ● ●●●
● ●● ●● ●●
● ●
●●
●
●●
● ●●
●● ●
●●
●
●
−1.0

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

Deep Learning – 14 / 14
CLASSIFICATION: 500 TRAINING ITERATIONS
nnet: size=3; maxit=500
Train: mmce=0.290; CV: mmce.test.mean=0.374

● ● ● 1
0.0 ●
●●
● ●
● ● ●
● ●●
● ●
● ●
●● ● ● ● ●
● ●
●
2
● ●●
●●
●
●●
●●●● ●● ●●●●

●●
●●
●
● ●● ●
● ●
●
●●
●●●
● ●● ●
● ● ●
● ●●●
●
● ● ● ●●
● ● ●● ●
● ● ●

● ● ● ●● ●●●
●
● ●●
−0.5 ● ●
●● ● ●●
●●
●● ● ●
● ● ●●
● ●●
● ●
● ●
● ● ● ●●
●● ●
●
● ● ●
● ● ● ●● ●
● ●●
● ● ●● ●
●● ● ●●● ●
● ● ●● ●
● ● ● ●
−1.0

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

Deep Learning – 14 / 14
CLASSIFICATION: 500 TRAINING ITERATIONS
nnet: size=5; maxit=500
Train: mmce=0.272; CV: mmce.test.mean=0.322

●
●● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
●●
1.0 ●
● ●●● ●
● ●●●
● ● ●●
● ● ● ●
●
●● ● ●
●
●●
● ●● ●
●
●
● ●
●●
●●
● ● ●●●
● ●
●
●
●
●
●
●
●● ● ●
●● ●
● ●
●
● ●● ●● ●● ●●●
● ●
●● ●
●● ●● ●
● ● ●
0.5 ● ●●
●●●● ● ●
●
● ●● ●
●●
●● ●
●
● ●●
●
●●
●
●
● ●
● ●
●●
● ●●
●
● ● ● ●●
● ● ●
● ●
●● ● ●●●
●● ●
●●
● ●●●
●
● ● ● ● ●
●● ● ●
● ● ●
● ● ● ●
● ●●
●
●
●
●●●●
●
●
● ●●
●●
●
● ●
● ● ●
class
● ●● ●
●●
● ●
●●
●
●
●
●
● ● ●
● ●
● ●
●
●●●
●
● ●
● ● ●●
●●
● ● ●
x2

● ● 1
0.0 ●
● ● ●
● ●
● ● ● ● ●
●
● 2
●●
● ●●
● ● ●
●
●● ● ●
● ● ● ● ● ●
●●
−0.5 ● ●
● ●●
● ●●
●●● ●
● ●
●

●●
● ●●
●
● ● ● ●● ●● ●
●●
●●
●●●● ●
● ●● ●● ● ● ● ●●
● ● ●● ●
● ●● ●●
●●●●
●
● ●●
●● ●
● ● ●
● ●●●● ● ●● ●●●
● ● ●
●●●
●
●●● ●
●
●●● ●●●
●●●●● ●
●
●●
●●
● ●● ●
●
● ●
●●●
●● ●●
●
−1.0

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

Deep Learning – 14 / 14
CLASSIFICATION: 500 TRAINING ITERATIONS
nnet: size=10; maxit=500
Train: mmce=0.184; CV: mmce.test.mean=0.106

●
●● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
1.0 ● ●● ● ●●
● ● ●
●●
● ●
● ● ●
●
● ●
●
● ●●
● ● ● ● ●
● ●
●
●
●
●
●
●
●● ● ●
●● ●
● ●
●
● ●● ●● ●● ●●●
● ●
●● ●
●● ●● ●
● ● ●
0.5 ● ●●
●●●● ● ●
●
● ●● ●
●●
●● ●
●
●
● ●●
●
●
●
●● ● ● ●● ●
● ●
●
● ●●
● ● ● ●
● ●
●●● ●
● ●
●
● ●● ●●●● ●
●●●●
●
● ●
●
●●
● ●
●●
●●●
●
●
●
● ●
●● ●
● ●
● ●
● ●
● ●●
●● ●● ● ●
●
class
● ●
●●
●
● ● ●
●
● ●
● ●●
● ● ●● ●
● ●
● ●●● ●
● ●
x2

0.0 ●
●
● ●
● ● ● 1
●
● ● ●
● ●
●
● ●● ● ● 2
●●
●
● ●
●● ●●
●
●●
● ● ●
●
●
●●
●
● ●
● ●
● ●●
● ● ●
●●
−0.5 ● ●
● ●● ● ● ●●
●●
● ●
● ● ● ●● ●
● ● ●
● ● ● ●● ●
● ●●
● ● ●● ●
●● ● ●●● ●
● ● ●● ●
● ● ● ●
−1.0

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

Deep Learning – 14 / 14
CLASSIFICATION: 500 TRAINING ITERATIONS
nnet: size=30; maxit=500
Train: mmce=0.000; CV: mmce.test.mean=0.034

● ● ● 1
0.0 ● ●
● ●
● ● ● ● ●
●
● 2
●●
● ●●
● ● ●
●
●● ● ●
● ● ● ● ● ●
●●
−0.5 ● ●
● ●● ● ● ●●
●●
● ●
● ● ● ●● ●
● ● ●
● ● ● ●● ●
● ●●
● ● ●● ●
●● ● ●●● ●
● ● ●● ●
● ● ● ●
−1.0

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

Deep Learning – 14 / 14
CLASSIFICATION: 500 TRAINING ITERATIONS
nnet: size=50; maxit=500
Train: mmce=0.000; CV: mmce.test.mean=0.026

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

Deep Learning – 14 / 14
Deep Learning

Brief History

Learning goals
Predecessors of modern (deep)
neural networks
History of DL as a field
A BRIEF HISTORY OF NEURAL NETWORKS

1943: The first artificial neuron, the "Threshold Logic Unit (TLU)",
was proposed by Warren McCulloch & Walter Pitts.

The model is limited to binary inputs.

It fires/outputs +1 if the input exceeds a certain threshold θ .
The weight are not adjustable, so learning could only be
achieved by changing the threshold θ .

Deep Learning – 1 / 13
A BRIEF HISTORY OF NEURAL NETWORKS
1957: The perceptron was invented by Frank Rosenblatt.

The inputs are not restricted to be binary.

The weights are adjustable and can be learned by learning
algorithms.
As for the TLU, the threshold is adjustable and decision
boundaries are linear.

Deep Learning – 2 / 13
A BRIEF HISTORY OF NEURAL NETWORKS
1960: Adaptive Linear Neuron (ADALINE) was invented by
Bernard Widrow & Ted Hoff; weights are now adjustable according
to the weighted sum of the inputs.

1965: Group method of data handling (also known as polynomial

neural networks) by Alexey Ivakhnenko. The first learning
algorithm for supervised deep feedforward multilayer perceptrons.

Deep Learning – 3 / 13
A BRIEF HISTORY OF NEURAL NETWORKS
1969: The first “AI Winter” kicked in.
Marvin Minsky & Seymour Papert proved that a perceptron
cannot solve the XOR-Problem (linear separability).
Less funding ⇒ Standstill in AI/DL research.

1985: Multilayer perceptron with backpropagation by David

Rumelhart, Geoffrey Hinton, and Ronald Williams.
Efficiently compute derivatives of composite functions.
Backpropagation was developed already in 1970 by
Linnainmaa.
Deep Learning – 4 / 13
A BRIEF HISTORY OF NEURAL NETWORKS
1985: The second “AI Winter” kicked in.
Overly optimistic expectations concerning potential of AI/DL.
The phrase “AI” even reached a pseudoscience status.
Kernel machines and graphical models both achieved good results on
many important tasks.
Some fundamental mathematical difficulties in modeling long sequences
were identified.

Credit: https://emerj.com/ai-executive-guides/will-there-be-another-artificial-intelligence-winter-probably-not/

Deep Learning – 5 / 13
A BRIEF HISTORY OF NEURAL NETWORKS
2006: Age of deep neural networks began.
Geoffrey Hinton showed that a deep belief network could be efficiently
trained using greedy layer-wise pretraining.
This wave of research popularized the use of the term deep learning to
emphasize that researchers were now able to train deeper neural networks
than had been possible before.
At this time, deep neural networks outperformed competing AI systems
based on other ML technologies as well as hand-designed functionality.

Deep Learning – 6 / 13
A BRIEF HISTORY OF NEURAL NETWORKS

Credit: https://towardsdatascience.com/a-weird-introduction-to-deep-learning-7828803693b0

Deep Learning – 7 / 13
A BRIEF HISTORY OF NEURAL NETWORKS

Deep Learning – 8 / 13
A BRIEF HISTORY OF NEURAL NETWORKS

Figure: IBM Supercomputer

Watson is a question-answering system capable of answering questions posed in

natural language, developed in IBM’s DeepQA project.

In 2011, Watson competed on Jeopardy! against champions Brad Rutter and

Ken Jennings, winning the first place prize of $1 million.

Deep Learning – 9 / 13
A BRIEF HISTORY OF NEURAL NETWORKS

Figure: Google self driving car (Waymo)

Google’s development of self-driving technology began on January 17, 2009, at

the company’s secretive X lab.

By January 2020, 20 million miles of self-driving on public roads had been

completed by Waymo.

Deep Learning – 10 / 13
A BRIEF HISTORY OF NEURAL NETWORKS

Credit: DeepMind

AlphaFold is a deep learning system, developed by Google DeepMind, for

determining a protein’s 3D shape from its amino-acid sequence.

In 2018 and 2020, AlphaFold placed first in the overall rankings of the Critical
Assessment of Techniques for Protein Structure Prediction (CASP).

Deep Learning – 11 / 13
A BRIEF HISTORY OF NEURAL NETWORKS

Credit: DeepMind

AlphaGo, originally developed by DeepMind, is a deep learning system that

plays the board game Go. In 2017, the Master version of AlphaGo beat Ke Jie,
the number one ranked player in the world at the time.

While there are several extensions to AlphaGo (e.g., Master AlphaGo, AlphaGo
Zero, AlphaZero, and MuZero), the main idea is the same: search for optimal
moves based on knowledge acquired by machine learning.

Deep Learning – 12 / 13
A BRIEF HISTORY OF NEURAL NETWORKS

Generative Pre-trained Transformer 3 (GPT-3) is the third generatation of the

GPT model, introduced by OpenAI in May 2020, to produce human-like text.

There are 175 billion parameters to be learned by the algorithm, but the quality of
the generated text is so high that it is hardly possible to distinguish it from a
human-written text.

Deep Learning – 13 / 13
Deep Learning

Basic Training

Learning goals
Empirical risk minimization
Gradient descent
Stochastic gradient descent
TRAINING NEURAL NETWORKS
In ML we use empirical risk minimization (ERM) to minimize
prediction losses over the training data
n
1 X (i ) (i )
Remp = L y ,f x | θ
n
i =1

In DL, θ represents the weights (and biases) of the NN.

Often, L2 -loss in regression:

1
L (y , f (x)) = (y − f (x))2
2
or cross-entropy for binary classification

L (y , f (x)) = −(y log f (x) + (1 − y ) log(1 − f (x)))

ERM can be implemented by gradient descent (GD).

Deep Learning – 1 / 11
GRADIENT DESCENT
Neg. risk gradient points in the direction of the steepest descent
⊤
∂Remp ∂Remp
−g = −∇Remp (θ) = − ,...,
∂θ1 ∂θd

“Standing” at a point θ [t ] , we locally improve by:

θ [t +1] = θ [t ] − αg,

α is called step size or learning rate.

3.0
−0.15

2.5
−0.25

−0.35
−0.1

2.0
−0.45
θ2

1.5
−0.2
1.0

−0.4
−0.3
3 −0.3
0.5

−0.2
−0.4 2
−0 −0.1 5
.05
0.0

.0
−0
0

0 1
1 0.0 0.5 1.0 1.5 2.0 2.5 3.0
2 θ1
0

Deep Learning – 2 / 11
GRADIENT DESCENT AND OPTIMALITY
GD is a greedy algorithm: In
every iteration, it makes
locally optimal moves.

If Remp (θ) is convex and

differentiable, and its
gradient is Lipschitz
continuous, GD is guaranteed
to converge to the global
minimum (for small enough
step-size).

However, if Remp (θ) has

multiple local optima and/or
saddle points, GD might only
converge to a stationary point
(other than the global
optimum), depending on the
starting point.

Deep Learning – 3 / 11
GRADIENT DESCENT AND OPTIMALITY
Note: It might not be that bad if we do not find the global optimum:
We don’t optimize the (theoretical) risk, but only an approximate
version, i.e. the empirical risk.
For very flexible models, aggressive optimization might overfitting.
Early-stopping might even increase generalization performance.

Deep Learning – 4 / 11
LEARNING RATE (LR)
The step-size α plays a key role in the convergence of the algorithm.

If the step size is too small, the training process may converge very
slowly (see left image). If the step size is too large, the process may not
converge, because it jumps around the optimal point (see right image).

Deep Learning – 5 / 11
LEARNING RATE
So far we have assumed a fixed value of α in every iteration:

α[t ] = α ∀t = {1, . . . , T }
However, it makes sense to adapt α in every iteration:

Steps of gradient descent for Remp (θ) = 10 θ12 + 0.5 θ22 . Left: 100 steps for with a fixed
learning rate. Right: 40 steps with an adaptive learning rate.

Deep Learning – 6 / 11
WEIGHT INITIALIZATION
Weights (and biases) of an NN must be initialized in GD.
We somehow must "break symmetry" – which would happen in
full-0-initialization. If two neurons (with the same activation) are
connected to the same inputs and have the same initial weights,
then both neurons will have the same gradient update and learn
the same features.
Weights are typically drawn from a uniform a Gaussian distribution
(both centered at 0 with a small variance).
Two common initialization strategies are ’Glorot initialization’ and
’He initialization’ which tune the variance of these distributions
based on the topology of the network.

Deep Learning – 7 / 11
STOCHASTIC GRADIENT DESCENT (SGD)
GD for ERM was:
n
[ t + 1] [t ] 1 X
θ =θ −α· · ∇θ L y (i ) , f (x(i ) | θ [t ] )
n
i =1

Using the entire training set in GD to is called batch or

deterministic or offline training. This can be computationally
costly or impossible, if data does not fit into memory.
Idea: Instead of letting the sum run over the whole dataset, use
small stochastic subsets (minibatches), or only a single x(i ) .
If batches are uniformly sampled from Dtrain , our stochastic
gradient is in expectation the batch gradient ∇θ Remp (θ).
The gradient w.r.t. a single x is fast to compute but not reliable. But
its often used in a theoretical analysis of SGD.
→ We have a stochastic, noisy version of the batch GD.

Deep Learning – 8 / 11
STOCHASTIC GRADIENT DESCENT (SGD)
SGD on function 1.25(x1 + 6)2 + (x2 − 8)2 .

Source : Shalev-Shwartz and Ben-David. Understanding machine learning: From theory to algorithms. Cambridge University
Press, 2014.

Figure: Left = GD, right = SGD. Black line is an average of different SGD runs.

Deep Learning – 9 / 11
STOCHASTIC GRADIENT DESCENT

Algorithm Basic SGD pseudo code

1: Initialize parameter vector θ [0]
2: t ← 0
3: while stopping criterion not met do
4: Randomly shuffle data and partition into minibatches J1 , ..., JK of size m
5: for k ∈ {1, ..., K } do
6: t ←t +1
Compute gradient estimate with Jk : ĝ [t ] ← m1 (i ) (i )
| θ [t −1] ))
P
7: i ∈Jk ∇θ L(y , f (x
8: Apply update: θ [t ] ← θ [t −1] − αĝ [t ]
9: end for
10: end while

A full SGD pass over data is an epoch.

Minibatch sizes are typically between 50 and 1000.

Deep Learning – 10 / 11
STOCHASTIC GRADIENT DESCENT
SGD is the most used optimizer in ML and especially in DL.
We usually have to add a considerable amount of tricks to SGD to
make it really efficient (e.g. momentum). More on this later.
SGD with (small) batches has a high variance, although is
unbiased. Hence, the LR α is smaller than in the batch mode.
When LR is slowly decreased, SGD converges to local minimum.
Recent results indicate that SGD often leads to better
generalization than GD, and may result in indirect regularization.

Deep Learning – 11 / 11
Deep Learning

Chain Rule and Computational Graphs

Learning goals
Chain rule of calculus
Computational graphs
CHAIN RULE OF CALCULUS
The chain rule can be used to compute derivatives of the
composition of two or more functions.
Let x ∈ Rm , y ∈ Rn ,
g : Rm → Rn and f : Rn → R.
If y = g (x) and z = f (y), the chain rule yields:

∂z X ∂ z ∂ yj
= ·
∂ xi ∂ yj ∂ xi
j

or, in vector notation:

∂ y ⊤
∇x z = ∇y z ,
∂x
∂y
where ∂x is the (n × m) Jacobian matrix of g.

Deep Learning – 2 / ??
COMPUTATIONAL GRAPHS

CGs are nested expresssions,

visualized as graphs.
Each node is a variable, either
an input or derived.
Derived variables are functions
applied to other variables.
source : Goodfellow et al. (2016)

Figure: The computational graph for

the expression H = σ(XW + B ) with
activation function σ(·).

Deep Learning – 3 / ??
CHAIN RULE OF CALCULUS: EXAMPLE 1

Suppose we have the following

computational graph.
To compute the derivative of
∂z
∂ w we need to recursively
apply the chain rule. That is:

∂z ∂z ∂y ∂x
= · ·
∂w ∂y ∂x ∂w
= f3′ (y ) · f2′ (x ) · f1′ (w )
source : Goodfellow et al. (2016)
= f3′ (f2 (f1 (w ))) · f2′ (f1 (w )) · f1′ (w )
Figure: A computational
graph, such that
x = f1 (w ), y = f2 (x )
and z = f3 (y ).

Deep Learning – 4 / ??
CHAIN RULE OF CALCULUS: EXAMPLE 2

To compute ∇x z, we apply the chain rule

∂z P ∂ z ∂ yj ∂ z ∂ y1 ∂ z ∂ y2
∂ x1 = j ∂ yj ∂ x1 = ∂ y1 ∂ x1 + ∂ y2 ∂ x1
∂z P ∂ z ∂ yj ∂ z ∂ y1 ∂ z ∂ y2
∂ x2 = j ∂ yj ∂ x2 = ∂ y1 ∂ x2 + ∂ y2 ∂ x2

Therefore, the gradient of z w.r.t x is

" # " #" #
∂z ∂ y1 ∂ y2 ∂z ⊤
∂ x1 ∂ x1 ∂ x1 ∂ y1 ∂y
∇x z = ∂z = ∂ y1 ∂ y2 ∂z = ∂x ∇y z
∂ x2 ∂ x2 ∂ x2 ∂ y2
| {z } | {z }
( ∂∂ yx )⊤ ∇y z

Deep Learning – 5 / ??
COMPUTATIONAL GRAPH: NEURAL NET

Figure: A neural network can be seen as a computational graph. ϕ is the

weighted sum and σ and τ are the activations.
Note: In contrast to the top figure, the arrows in the computational graph below
merely indicate dependence, not weights.

Deep Learning – 6 / ??
Deep Learning

Basic Backpropagation 1

Learning goals
Forward and backward passes
Chain rule
Details of backprop
BACKPROPAGATION: BASIC IDEA
We would like to optimize ERM using gradient descent (GD) on:
n
1 X (i ) (i )
Remp (θ) = L y ,f x | θ .
n
i =1

Backprop training of NNs runs in 2 alternating steps, for one x:

1 Forward pass (FP): Inputs flow through model to outputs. We
then compute the observation loss (see previous chapters).
2 Backward pass (BP): Loss flows backwards to update weights so
error is reduced, as in GD.

We will see: This is simply (S)GD in disguise, cleverly using the chain
rule, so we can reuse a lot of intermediate results.

Bernd Bischl Deep Learning – 1 / 19

XOR EXAMPLE
As activations (hidden and outputs) we use the sigmoid function.
We run one FP and BP on x = (1, 0)T with y = 1.
We use L2 loss between 0-1 labels and the predicted probabilities.
This is a bit uncommon, but computations become simpler.

Note: We will only show rounded decimals.

Bernd Bischl Deep Learning – 2 / 19

FORWARD PASS
We will divide the FP into four steps:
the inputs of zi : zi ,in
the activations of zi : zi ,out
the input of f : fin
and finally the activation of f : fout

Bernd Bischl Deep Learning – 3 / 19

FORWARD PASS

T
z1,in = W1 x + b1 = 1 · (−0.07) + 0 · 0.22 + 1 · (−0.46) = −0.53
1
z1,out = σ (z1,in ) = = 0.3705
1 + exp(−(−0.53))

Bernd Bischl Deep Learning – 4 / 19

FORWARD PASS

T
z2,in = W2 x + b2 = 1 · 0.94 + 0 · 0.46 + 1 · 0.1 = 1.04
1
z2,out = σ (z2,in ) = = 0.7389
1 + exp(−1.04)

Bernd Bischl Deep Learning – 5 / 19

FORWARD PASS

T
fin = u z + c = 0.3705 · (−0.22) + 0.7389 · 0.58 + 1 · 0.78 = 1.1122
1
fout = τ (fin ) = = 0.7525
1 + exp(−1.1122)

Bernd Bischl Deep Learning – 6 / 19

FORWARD PASS
The FP predicted fout = 0.7525
Now, we compare the prediction fout = 0.7525 and the true label
y = 1 using the L2 -loss:

1 1
L (y , f (x)) = (y − f x(i ) | θ )2 = (y − fout )2
2 2
1 2
= (1 − 0.7525) = 0.0306
2
The calculation of the gradient is performed backwards (starting
from the output layer), so that results can be reused.

Bernd Bischl Deep Learning – 7 / 19

BACKWARD PASS
The main ingredients of the backward pass are:
to reuse the results of the forward pass
(here: zi ,in , zi ,out , fin , fout )
reuse the intermediate results from the chain rule
the derivative of the activations and some affine functions

Bernd Bischl Deep Learning – 8 / 19

BACKWARD PASS
Let’s start to update u1 . We recursively apply the chain rule:

∂ L (y , f (x)) ∂ L (y , f (x)) ∂ fout ∂ fin

= · ·
∂ u1 ∂ fout ∂ fin ∂ u1

Figure: Snippet from our NN, with backward path for u1 .

Bernd Bischl Deep Learning – 9 / 19

BACKWARD PASS
1st step: The derivative of L2 loss is easy; we know fout from FP.

∂ L (y , f (x)) d 1
= (y − fout )2 = − (y − fout )
∂ fout ∂ fout 2 | {z }
=
ˆ residual
= −(1 − 0.7525) = −0.2475

Bernd Bischl Deep Learning – 10 / 19

BACKWARD PASS
2nd step. fout = σ(fin ), use rule for σ ′ , use fin from FP.

∂ fout
= σ(fin ) · (1 − σ(fin ))
∂ fin
= 0.7525 · (1 − 0.7525) = 0.1862

Bernd Bischl Deep Learning – 11 / 19

BACKWARD PASS
3rd step. Derivative of the linear input is easy; use z1,out from FP.

∂ fin ∂(u1 · z1,out + u2 · z2,out + c · 1)

= = z1,out = 0.3705
∂ u1 ∂ u1

Bernd Bischl Deep Learning – 12 / 19

BACKWARD PASS
Plug it together:

∂ L (y , f (x)) ∂ L (y , f (x)) ∂ fout ∂ fin

= · ·
∂ u1 ∂ fout ∂ fin ∂ u1
= −0.2475 · 0.1862 · 0.3705 = −0.0171

With LR α = 0.5:

[new ] ∂ L (y , f (x))
[old ]
u1 = u1 −α·
∂ u1
= −0.22 − 0.5 · (−0.0171) = −0.2115

Bernd Bischl Deep Learning – 13 / 19

BACKWARD PASS
Now for W11 :

∂ L (y , f (x)) ∂ L (y , f (x)) ∂ fout ∂ fin ∂ z1,out ∂ z1,in

= · · · ·
∂ W11 ∂ fout ∂ fin ∂ z1,out ∂ z1,in ∂ W11

∂ L(y ,f (x)) ∂ fout

We know ∂ fout and ∂ fin from BP for u1 .

Bernd Bischl Deep Learning – 14 / 19

BACKWARD PASS
fin = u1 · z1,out + u2 · z2,out + c · 1 is linear, easy and we know u1 :

∂ fin
= u1 = −0.22
∂ z1,out

Bernd Bischl Deep Learning – 15 / 19

BACKWARD PASS
Next. Use rule for σ ′ and FP results:

∂ z1,out
= σ(z1,in ) · (1 − σ(z1,in ))
∂ z1,in
= 0.3705 · (1 − 0.3705) = 0.2332

Bernd Bischl Deep Learning – 16 / 19

BACKWARD PASS
z1,in = x1 · W11 + x2 · W21 + b1 · 1 is linear and depends on inputs:

∂ z1,in
= x1 = 1
∂ W11

Bernd Bischl Deep Learning – 17 / 19

BACKWARD PASS
Plugging together:
∂ L (y , f (x)) ∂ L (y , f (x)) ∂ fout ∂ fin ∂ z1,out ∂ z1,in
= · · · ·
∂ W11 ∂ fout ∂ fin ∂ z1,out ∂ z1,in ∂ W11
= (−0.2475) · 0.1862 · (−0.22) · 0.2332 · 1
= 0.0024

Full SGD update:

[new ] [old ] ∂ L (y , f (x))

W11 = W11 −α·
∂ W11
= −0.07 − 0.5 · 0.0024 = −0.0712

Bernd Bischl Deep Learning – 18 / 19

RESULT
We can do this for all weights:

−0.0712 0.9426 −0.4612
W = ,b = ,
0.22 0.46 0.1026

−0.2115
u= and c = 0.8030.
0.5970

Yields f (x | θ [new ] ) = 0.7615 and loss 21 (1 − 0.7615)2 = 0.0284.

Before, we had f (x | θ [old ] ) = 0.7525 and higher loss 0.0306.

Now rinse and repeat. This was one training iter, we do thousands.

Bernd Bischl Deep Learning – 19 / 19

Deep Learning

Basic Backpropagation 2

Learning goals
Backprop formalism and
recursion
BACKWARD COMPUTATION AND CACHING
In the XOR example, we computed:

∂ L (y , f (x)) ∂ L (y , f (x)) ∂ fout ∂ fin ∂ z1,out ∂ z1,in

= · · · ·
∂ W11 ∂ fout ∂ fin ∂ z1,out ∂ z1,in ∂ W11

Deep Learning – 1 / 9
BACKWARD COMPUTATION AND CACHING
Next, let us compute:

∂ L (y , f (x)) ∂ L (y , f (x)) ∂ fout ∂ fin ∂ z1,out ∂ z1,in

= · · · ·
∂ W21 ∂ fout ∂ fin ∂ z1,out ∂ z1,in ∂ W21

Deep Learning – 2 / 9
BACKWARD COMPUTATION AND CACHING
Examining the two expressions:
∂ L (y , f (x)) ∂ L (y , f (x)) ∂ fout ∂ fin ∂ z1,out ∂ z1,in
= · · · ·
∂ W11 ∂ fout ∂ fin ∂ z1,out ∂ z1,in ∂ W11

∂ L (y , f (x)) ∂ L (y , f (x)) ∂ fout ∂ fin ∂ z1,out ∂ z1,in

= · · · ·
∂ W21 ∂ fout ∂ fin ∂ z1,out ∂ z1,in ∂ W21
Significant overlap / redundancy in the two expressions.
Again: Let’s call this subexpression δ1 and cache it.

∂ L (y , f (x)) ∂ L (y , f (x)) ∂ fout ∂ fin ∂ z1,out

δ1 = = · · ·
∂ z1,in ∂ fout ∂ fin ∂ z1,out ∂ z1,in
∂ L (y , f (x)) ∂ z1,in ∂ L (y , f (x)) ∂ z1,in
= δ1 · and = δ1 ·
∂ W11 ∂ W11 ∂ W21 ∂ W21
δ1 can also be seen as an error signal that represents how much
the loss L changes when the input z1,in changes.
Deep Learning – 3 / 9
BACKPROP: RECURSION
Let us now derive a general formulation of backprop.
The neurons in layers i − 1, i and i + 1 are indexed by j, k and m,
respectively.
The output layer will be referred to as layer O.

Credit: Erik Hallström

Deep Learning – 4 / 9
BACKPROP: RECURSION

(i )
Let δk̃ (also: error signal) for a neuron k̃ in layer i represent how much
(i )
the loss L changes when the input zk̃ ,in changes:

(i ) (i +1)
! (i )
(i ) ∂L ∂L ∂ zk̃ ,out X ∂L ∂ zm,in ∂ zk̃ ,out
δk̃ = (i )
= (i ) (i )
= (i +1) (i ) (i )
∂ zk̃ ,in ∂ zk̃ ,out ∂ zk̃ ,in m ∂ zm,in ∂ zk̃ ,out ∂ zk̃ ,in

Note: The sum in the expression above is over all the neurons in layer
i + 1. This is simply an application of the (multivariate) chain rule.

Deep Learning – 5 / 9
BACKPROP: RECURSION
Using (i ) (i )
zk̃ ,out = σ(zk̃ ,in )
(i +1)
X (i + 1 ) (i ) (i + 1 )
zm,in = Wk ,m zk ,out + bm
k

we get:
(i +1)
! (i )
(i )
X ∂L ∂ zm,in ∂ zk̃ ,out
δk̃ = (i + 1 ) (i ) (i )
m ∂ zm,in ∂ zk̃ ,out ∂ zk̃ ,in
P
(i + 1 ) (i ) (i + 1 ) ! (i )
X ∂L ∂ k Wk ,m zk ,out + bm ∂σ(zk̃ ,in )
= (i + 1 ) (i ) (i )
m ∂ zm,in ∂ zk̃ ,out ∂ zk̃ ,in
! !
X ∂L X
σ ′ (zk̃ ,in ) = σ ′ (zk̃ ,in )
(i + 1 ) (i ) (i +1) (i +1) (i )
= (i + 1 )
Wk̃ ,m δm,in Wk̃ ,m
m ∂ zm,in m

Therefore, we now have a recursive definition for the error signal of a

neuron in layer i in terms of the error signals of the neurons in layer
i + 1 and, by extension, layers {i+2, i+3 . . . , O}!

Deep Learning – 6 / 9
BACKPROP: RECURSION
(i )
Given the error signal δ of neuron k̃ in layer i, the derivative of
k̃
loss L w.r.t. to the weight Wj̃ ,k̃ is simply:

(i )
∂L ∂ L ∂ zk̃ ,in (i ) (i −1)
(i )
= (i ) (i )
= δk̃ zj̃ ,out
∂ Wj̃ ,k̃ ∂ zk̃ ,in ∂ Wj̃ ,k̃

(i ) P (i ) (i −1) (i )
because z
k̃ ,in
= j W z
j ,k̃ j ,out
+ bk̃
(i )
Similarly, the derivative of loss L w.r.t. bias b is:
k̃

(i )
∂L ∂ L ∂ zk̃ ,in (i )
(i )
= (i ) (i )
= δk̃
∂ bk̃ ∂ zk̃ ,in ∂ bk̃

Deep Learning – 7 / 9
BACKPROP: RECURSION
It is not hard to show that the error signal δ i for an entire layer i is
(⊙ = element-wise product):
δ (O ) = ∇fout L ⊙ τ ′ (fin )
(i )
δ (i ) = W (i +1) δ (i +1) ⊙ σ ′ (zin )
Therefore, backpropagation works by computing and storing the
error signals backwards. That is, starting at the output layer and
ending at the first hidden layer. This way, the error signals of later
layers propagate backwards to the earlier layers.
The derivative of the loss L w.r.t. a given weight is computed
efficiently by plugging in the cached error signals, thereby avoiding
expensive and redundant computations.

Deep Learning – 8 / 9
Deep Learning

Hardware and Software

Learning goals
GPU training for accelerated
learning
Software for hardware support
Deep learning software platforms
Hardware for Deep Learning

Deep Learning – 1 / 9
HARDWARE FOR DEEP LEARNING
Deep NNs require special hardware to be trained efficiently.
The training is done using Graphics Processing Units (GPUs) and
a special programming language called CUDA.
For most NNs, training on standard CPUs takes very long.

Figure: Left: Each CPU can do 2-8 parallel computations. Right: A single
GPU can do thousands of simple parallel computations.

Deep Learning – 2 / 9
GRAPHICS PROCESSING UNITS (GPUS)

Initially developed to accelerate the creation of graphics

Massively parallel: identical and independent computations for
every pixel
Computer Graphics makes heavy use of linear algebra (just like
neural networks)
Less flexible than CPUs: all threads in a core concurrently execute
the same instruction on different data.
Very fast for CNNs, RNNs need more time
Popular ones: GTX 1080 Ti, RTX 3080 / 2080 Ti, Titan RTX, Tesla
V100 / A100
Hundreds of threads per core, few thousand cores, around 10
teraFLOPS in single precision, some 10s GBs of memory
Memory is important - some SOTA architectures do not fit GPUs
with <10 GB

Deep Learning – 3 / 9
TENSOR PROCESSING UNITS (TPUS)

Specialized and proprietary chip for deep learning developed by

Google
Hundreds of teraFLOPS per chip
Can be connected together in pods of thousands TPUs each
(result: hundreds of petaFLOPS per pod)
Not a consumer product – can be used in the Google Cloud
Platform (from >1.35 USD / TPU / hour) or Google Colab (free!)
Enabled impressive progress : DeepMind’s AlphaZero for Chess
became world champion after just 4h of training on 5064 TPUs

Deep Learning – 4 / 9
AND EVERYTHING ELSE...
With such powerful devices, memory/disk access during training
becomes the bottleneck
Nvidia DGX-1: Specialized solution with eight Tesla V100
GPUs, dual Intel Xeon, 512 GB of RAM, 4 SSD disks of 2TB
each
Specialized hardware for on-device inference
Example: Neural Engine on the Apple A11 (used for FaceID)
Keywords/buzzwords: Edge computing and Federated
learning

Deep Learning – 5 / 9
Software for Deep Learning

Deep Learning – 6 / 9
SOFTWARE FOR DEEP LEARNING
CUDA is a very low level programming language and thus writing
code for deep learning requires a lot of work.
Deep learning (software) frameworks:
Abstract the hardware (same code for CPU/GPU/TPU)
Automatically differentiate all computations
Distribute training among several hosts
Provide facilities for visualizing and debugging models
Can be used from several programming languages
Based on the concept of computational graph

Deep Learning – 7 / 9
SOFTWARE FOR DEEP LEARNING
Tensorflow
Popular in the industry
Developed by Google and
open source community
Python, R, C++ and Javascript APIs
Distributed training on GPUs and TPUs
Tools for visualizing neural nets, running
them efficiently on phones and embedded devices.
Keras
Intuitive, high-level wrapper
of Tensorflow for rapid prototyping
Python and (unofficial) R APIs

Deep Learning – 8 / 9
SOFTWARE FOR DEEP LEARNING
Pytorch
(Most) Popular in academia
Supported by Facebook
Python and C++ APIs
Distributed training on GPUs

MXNet
Open-source deep learning framework
written in C++ and cuda (used by
Amazon for their Amazon Web Services)
Scalable, allowing fast model training
Supports flexible model programming and
multiple languages (C++, Python, Julia,
Matlab, JavaScript, Go, R, Scala, Perl)

Deep Learning – 9 / 9
Deep Learning

Basic Regularization

Learning goals
Regularized cost functions
Norm penalties
Weight decay
Equivalence with constrained
optimization
REGULARIZATION
Any technique that is designed to reduce the test error possibly at
the expense of increased training error can be considered a form
of regularization.
Regularization is important in DL because NNs can have
extremely high capacity (millions of parameters) and are thus
prone to overfitting.

Deep Learning – 1 / 9
REVISION: REGULARIZED RISK MINIMIZATION
The goal of regularized risk minimization is to penalize the
complexity of the model to minimize the chances of overfitting.
By adding a parameter norm penalty term J (θ) to the empirical
risk Remp (θ) we obtain a regularized cost function:

Rreg (θ) = Remp (θ) + λJ (θ)

with hyperparamater λ ∈ [0, ∞), that weights the penalty term,

relative to the unconstrained objective function Remp (θ).
Therefore, instead of pure empirical risk minimization, we add a
penalty for complex (read: large) parameters θ .
Declaring λ = 0 obviously results in no penalization.
We can choose between different parameter norm penalties J (θ).
In general, we do not penalize the bias.

Deep Learning – 2 / 9
L2-REGULARIZATION / WEIGHT DECAY
Let us optimize the L2-regularized risk of a model f (x | θ)

λ
min Rreg (θ) = min Remp (θ) + kθk22
θ θ 2
by gradient descent. The gradient is

∇Rreg (θ) = ∇Remp (θ) + λθ.

We iteratively update θ by step size α times the negative gradient

θ [new] = θ [old] −α ∇Remp (θ) + λθ [old] = θ [old] (1−αλ)−α∇Remp (θ)

→ The term λθ [old ] causes the parameter (weight) to decay in

proportion to its size (which gives rise to the name).

Deep Learning – 3 / 9
EQUIVALENCE TO CONSTRAINED OPTIMIZATION
Norm penalties can be interpreted as imposing a constraint on the
weights. One can show that

arg min Remp (θ) + λJ (θ)

is equvilalent to

arg min Remp (θ)

θ
subject to J (θ) ≤ k

for some value k that depends on λ the nature of Remp (θ).

(Goodfellow et al. (2016), ch. 7.2)

Deep Learning – 4 / 9
EXAMPLE: WEIGHT DECAY

We fit the huge neural

network on the right side
on a smaller fraction of
MNIST (5000 train and
1000 test observations)
Weight decay: λ ∈
(10−2 , 10−3 , 10−4 , 10−5 , 0)

Deep Learning – 5 / 9
EXAMPLE: WEIGHT DECAY
0.100

0.075

weight decay
0
train error

10^(−5)
0.050
10^(−4)
10^(−3)
10^(−2)

0.025

0.000

0 50 100 150 200

epochs

A high weight decay of 10−2 leads to a high error on the training data.

Deep Learning – 6 / 9
EXAMPLE: WEIGHT DECAY
0.100

0.075

weight decay
0
test error

10^(−5)
0.050
10^(−4)
10^(−3)
10^(−2)

0.025

0.000

0 50 100 150 200

epochs

Second strongest weight decay leads to the best result on the test data.

Deep Learning – 7 / 9
TENSORFLOW PLAYGROUND

https://playground.tensorflow.org/

Deep Learning – 8 / 9
TENSORFLOW PLAYGROUND - EXERCISE

https://developers.google.com/machine-learning/crash-course/
regularization-for-simplicity/
playground-exercise-examining-l2-regularization

Deep Learning – 9 / 9
Introduction to Machine Learning

Regularization in Non-Linear Models and

Bayesian Priors

Learning goals
Understand that regularization
and parameter shrinkage can be
applied to non-linear models
Know structural risk minimization
Know how regularization risk
minimization is the same as MAP
in a Bayesian perspective, where
the penalty corresponds to
parameter prior.
SUMMARY: REGULARIZED RISK MINIMIZATION
If we should define ML in only one line, this might be it:
n
!
X
min Rreg (θ) = min L y (i ) , f x(i ) | θ + λ · J (θ)
θ θ
i =1

We can choose for a task at hand:

the hypothesis space of f , which determines how features can
influence the predicted y
the loss function L, which measures how errors should be treated
the regularization J (θ), which encodes our inductive bias and
preference for certain simpler models

By varying these choices one can construct a huge number of different

ML models. Many ML models follow this construction principle or can
be interpreted through the lens of regularized risk minimization.

c Introduction to Machine Learning – 1 / 13

REGULARIZATION IN NONLINEAR MODELS

So far we have mainly considered regularization in LMs.

Can also be applied to non-linear models (with numeric
parameters), where it is often important to prevent overfitting.
Here, we typically use L2 regularization, which still results in
parameter shrinkage and weight decay.
By adding regularization, prediction surfaces in regression and
classification become smoother.
Note: In the chapter on non-linear SVMs we will study the effects
of regularization on a non-linear model in detail.

c Introduction to Machine Learning – 2 / 13

REGULARIZATION IN NONLINEAR MODELS
Setting: Classification for the spirals data. Neural network with single
hidden layer containing 10 neurons and logistic output activation, regularized
with L2 penalty term for λ > 0. Varying λ affects smoothness of the decision
boundary and magnitude of network weights:

c Introduction to Machine Learning – 3 / 13

REGULARIZATION IN NONLINEAR MODELS
The prevention of overfitting can also be seen in CV. Same settings as
before, but each λ is evaluated with repeated CV (10 folds, 5 reps).

We see the typical U-shape with the sweet spot between overfitting
(LHS, low λ) and underfitting (RHS, high λ) in the middle.

c Introduction to Machine Learning – 4 / 13

STRUCTURAL RISK MINIMIZATION

Thus far, we only considered adding a complexity penalty to

empirical risk minimization.
Instead, structural risk minimization (SRM) assumes that the
hypothesis space H can be decomposed into increasingly
complex hypotheses (size or capacity): H = ∪k ≥1 Hk .
Complexity parameters can be the, e.g. the degree of polynomials
in linear models or the size of hidden layers in neural networks.

c Introduction to Machine Learning – 5 / 13

STRUCTURAL RISK MINIMIZATION
SRM chooses the smallest k such that the optimal model from Hk
found by ERM or RRM cannot significantly be outperformed by a
model from a Hm with m > k .
By this, the simplest model can be chosen, which minimizes the
generalization bound.
One challenge might be choosing an adequate complexity
measure, as for some models, multiple complexity measures exist.

generalization error

complexity

training error

c Introduction to Machine Learning – 6 / 13

STRUCTURAL RISK MINIMIZATION
Setting: Classification for the spirals data. Neural network with single
hidden layer containing k neurons and logistic output activation, L2 regularized
with λ = 0.001. So here SRM and RRM are both used. Varying the size of the
hidden layer affects smoothness of the decision boundary:

c Introduction to Machine Learning – 7 / 13

STRUCTURAL RISK MINIMIZATION
Again, complexity vs CV score.

A minimal model with good generalization seems to have ca. 6-8

hidden neurons.

c Introduction to Machine Learning – 8 / 13

STRUCTURAL RISK MINIMIZATION AND RRM
Note that normal RRM can also be interpreted through SRM, if we
rewrite the penalized ERM as constrained ERM.

n
X
min L y (i ) , f x(i ) | θ
θ
i =1

s.t. kθk22 ≤ t

We can interpret going through λ from large to small as through t from

small to large. This constructs a series of ERM problems with
hypothesis spaces Hλ , where we constrain the norm of θ to unit balls
of growing size.

c Introduction to Machine Learning – 9 / 13

RRM VS. BAYES
We already created a link between max. likelihood estimation and ERM.

Now we will generalize this for RRM.

Assume we have a parameterized distribution p(y |θ, x) for our data and
a prior q (θ) over our parameter space, all in the Bayesian framework.

From the Bayes theorem we know:

p(y |θ, x)q (θ)

p(θ|x, y ) = ∝ p(y |θ, x)q (θ)
p(y |x)

c Introduction to Machine Learning – 10 / 13

RRM VS. BAYES
The maximum a posteriori (MAP) estimator of θ is now the minimizer of

− log p (y | θ, x) − log q (θ).

Again, we identify the loss L (y , f (x | θ)) with − log(p(y |θ, x)).

If q (θ) is constant (i.e., we used a uniform, non-informative prior),
the second term is irrelevant and we arrive at ERM.
If not, we can identify J (θ) ∝ − log(q (θ)), i.e., the log-prior
corresponds to the regularizer, and the additional λ, which controls
the strength of our penalty, usually influences the peakedness /
inverse variance / strength of our prior.

c Introduction to Machine Learning – 11 / 13

RRM VS. BAYES

L2 regularization corresponds to a zero-mean Gaussian prior with

constant variance on our parameters: θj ∼ N (0, τ 2 )
L1 corresponds to a zero-mean Laplace prior: θj ∼ Laplace(0, b).
1 |µ−x |
Laplace(µ, b) has density 2b exp(− b
), with scale parameter b,
2
mean µ and variance 2b .
In both cases, regularization strength increases as the variance of
the prior decreases: a prior probability mass more narrowly
concentrated around 0 encourages shrinkage.

c Introduction to Machine Learning – 12 / 13

EXAMPLE: BAYESIAN L2 REGULARIZATION
We can easily see the equivalence of L2 regularization and a Gaussian prior:

We define a Gaussian prior with uncorrelated components for θ :

d d
!
2
Y 2 2 − d2 1 X 2
q (θ) = Nd (0, diag (τ )) = N (0, τ ) = (2πτ ) exp − 2 θj .
2τ
j =1 j =1

With this, the MAP estimator becomes

θ̂MAP = arg minθ (− log p (y | θ, x) − log q (θ))

d
!
d 1 X 2
2
= arg minθ − log p (y | θ, x) + 2
log(2πτ ) + 2 θj
2τ
j =1

1
= arg minθ − log p (y | θ, x) + kθk22 .
2τ 2

We see how the inverse variance (precision) 1/τ 2 controls shrinkage.

c Introduction to Machine Learning – 13 / 13

Introduction to Machine Learning

Geometric Analysis of L2 Regularization

and Weight Decay

Learning goals
Have a geometric understanding
of L2 regularization
Understand why L2
regularization in combination
with gradient descent is called
weight decay
WEIGHT DECAY VS. L2 REGULARIZATION
Let us optimize the L2-regularized risk of a model f (x | θ)

λ
min Rreg (θ) = min Remp (θ) + ∥θ∥22
θ θ 2
by gradient descent. The gradient is

∇θ Rreg (θ) = ∇θ Remp (θ) + λθ.

We iteratively update θ by step size α times the negative gradient

θ [new] = θ [old] − α ∇θ Remp (θ [old] ) + λθ [old]
= θ [old] (1 − αλ) − α∇θ Remp (θ [old] ).

The term λθ [old ] causes the parameter (weight) to decay in proportion

to its size. This is a very well-known technique in deep learning - and
simply L2 regularization in disguise.

© Introduction to Machine Learning – 1 / 12

WEIGHT DECAY VS. L2 REGULARIZATION
When we use weight decay, we follow the steepest slope of Remp as for
gradient descent, but in every step, we are pulled back to the origin.

© Introduction to Machine Learning – 2 / 12

WEIGHT DECAY VS. L2 REGULARIZATION
How strongly we are pulled back to the origin for a fixed stepsize α
depends only on λ (as long as the procedure converges):

© Introduction to Machine Learning – 3 / 12

WEIGHT DECAY VS. L2 REGULARIZATION
Weight decay can be interpreted geometrically.

Let’s use a quadratic Taylor approximation of the unregularized

objective Remp (θ) in the neighborhood of its minimizer θ̂ ,

1
R̃emp (θ) = Remp (θ̂) + ∇θ Remp (θ̂) · (θ − θ̂) + (θ − θ̂)T H (θ − θ̂),
2

where H is the Hessian matrix of Remp (θ) evaluated at θ̂ .

The first-order term is 0 in the expression above because the

gradient is 0 at the minimizer.
H is positive semidefinite.

© Introduction to Machine Learning – 4 / 12

WEIGHT DECAY VS. L2 REGULARIZATION
The minimum of R̃emp (θ) occurs where ∇θ R̃emp (θ) = H (θ − θ̂) is 0.

Now we L2-regularize R̃emp (θ), such that

λ
R̃reg (θ) = R̃emp (θ) + ∥θ∥22
2

and solve this approximation of Rreg for the minimizer θ̂Ridge :

∇θ R̃reg (θ) = 0,
λθ + H (θ − θ̂) = 0,
(H + λI )θ = H θ̂,
θ̂Ridge = (H + λI )−1 H θ̂,

This give us a formula to see how the minimizer of the L2-regularized

version is a transformation of the minimizer of the unpenalized version.

© Introduction to Machine Learning – 5 / 12

WEIGHT DECAY VS. L2 REGULARIZATION
As λ approaches 0, the regularized solution θ̂Ridge approaches θ̂ .
What happens as λ grows?
Because H is a real symmetric matrix, it can be decomposed as
H = Q ΣQ ⊤ , where Σ is a diagonal matrix of eigenvalues and Q
is an orthonormal basis of eigenvectors.
Rewriting the transformation formula with this:
− 1
θ̂Ridge = Q ΣQ ⊤ + λI Q ΣQ ⊤ θ̂
h i−1
= Q (Σ + λI )Q ⊤ Q ΣQ ⊤ θ̂

= Q (Σ + λI )−1 ΣQ ⊤ θ̂

Therefore, weight decay rescales θ̂ along the axes defined by the

eigenvectors of H. The component of θ̂ that is aligned with the j-th
σj
eigenvector of H is rescaled by a factor of σj +λ , where σj is the
corresponding eigenvalue.
© Introduction to Machine Learning – 6 / 12
WEIGHT DECAY VS. L2 REGULARIZATION
Firstly, θ̂ is rotated by Q ⊤ , which we can interpret as a projection of θ̂
on the rotated coordinate system defined by the principal directions of
H:

© Introduction to Machine Learning – 7 / 12

WEIGHT DECAY VS. L2 REGULARIZATION
Since, for λ = 0, the transformation matrix (Σ + λI )−1 Σ = Σ−1 Σ = I,
we simply arrive at θ̂ again after projecting back.

© Introduction to Machine Learning – 8 / 12

WEIGHT DECAY VS. L2 REGULARIZATION
If λ > 0, the component projected on the j-th axis gets rescaled by
σj
σj +λ before θ̂Ridge is rotated back.

© Introduction to Machine Learning – 9 / 12

WEIGHT DECAY VS. L2 REGULARIZATION
Along directions where the eigenvalues of H are relatively large, for
example, where σj >> λ, the effect of regularization is quite small.
On the other hand, components with σj << λ will be shrunk to
have nearly zero magnitude.
In other words, only directions along which the parameters
contribute significantly to reducing the objective function are
preserved relatively intact.
In the other directions, a small eigenvalue of the Hessian means
that moving in this direction will not significantly increase the
gradient. For such unimportant directions, the corresponding
components of θ are decayed away.

© Introduction to Machine Learning – 10 / 12

WEIGHT DECAY VS. L2 REGULARIZATION

Credit: Goodfellow et al. (2016), ch. 7

Figure: The solid ellipses represent the contours of the unregularized objective and
the dashed circles represent the contours of the L2 penalty. At θ̂Ridge , the competing
objectives reach an equilibrium.

In the first dimension, the eigenvalue of the Hessian of Remp (θ) is small. The
objective function does not increase much when moving horizontally away
from θ̂ . Therefore, the regularizer has a strong effect on this axis and θ1 is
pulled close to zero.

© Introduction to Machine Learning – 11 / 12

WEIGHT DECAY VS. L2 REGULARIZATION

Credit: Goodfellow et al. (2016), ch. 7

In the second dimension, the corresponding eigenvalue is large indicating high

curvature. The objective function is very sensitive to movement along this axis
and, as a result, the position of θ2 is less affected by the regularization.

© Introduction to Machine Learning – 12 / 12

Introduction to Machine Learning

Early Stopping

Learning goals
Know how early stopping works
Understand how early stopping
acts as a regularizer
EARLY STOPPING
When training with an iterative optimizer such as SGD, it is
commonly the case that, after a certain number of iterations,
generalization error begins to increase even though training error
continues to decrease.
Early stopping refers to stopping the algorithm early before the
generalization error increases.

Figure: After a certain number of iterations, the algorithm begins to overfit.

© Introduction to Machine Learning – 1 / 4

EARLY STOPPING
How early stopping works:
1 Split training data Dtrain into Dsubtrain and Dval (e.g. with a ratio of
2:1).
2 Train on Dsubtrain and evaluate model using the validation set Dval .
3 Stop training when validation error stops decreasing (after a range
of “patience” steps).
4 Use parameters of the previous step for the actual model.
More sophisticated forms also apply cross-validation.

© Introduction to Machine Learning – 2 / 4

EARLY STOPPING
Strengths Weaknesses
Effective and simple Periodical evaluation of validation error
Applicable to almost any Temporary copy of θ (we have to save
model without adjustment the whole model each time validation
error improves)
Combinable with other Less data for training → include Dval
regularization methods afterwards

Relation between optimal early-stopping iteration Tstop and

weight-decay penalization parameter λ for step-size α (see
Goodfellow et al. (2016) page 251-252 for proof):
1 1
Tstop ≈ ⇔λ≈
αλ Tstop α

Small λ (low penalization) ⇒ high Tstop (complex model / lots of

updates).

© Introduction to Machine Learning – 3 / 4

EARLY STOPPING

Credit: Goodfellow et al. (2016)

Figure: An illustration of the effect of early stopping. Left: The solid contour lines
indicate the contours of the negative log-likelihood. The dashed line indicates the
trajectory taken by SGD beginning from the origin. Rather than stopping at the point θ̂
that minimizes the risk, early stopping results in the trajectory stopping at an earlier
point θ̂Ridge . Right: An illustration of the effect of L2 regularization for comparison. The
dashed circles indicate the contours of the L2 penalty which causes the minimum of the
total cost to lie closer to the origin than the minimum of the unregularized cost.

© Introduction to Machine Learning – 4 / 4

Deep Learning

Dropout and Augmentation

Learning goals
Recap: Ensemble Methods
Dropout
Augmentation
Ensemble Methods

Deep Learning – 1 / 22
RECAP: ENSEMBLE METHODS
Idea: Train several models separately, and average their
prediction (i.e. perform model averaging).
Intuition: This improves performance on test set, since different
models will not make the same errors.
Ensembles can be constructed in different ways, e.g.:
by combining completely different kind of models (using
different learning algorithms and loss functions).
by bagging: train the same model on k datasets, constructed
by sampling n samples from original dataset.
Since training a neural network repeatedly on the same dataset
results in different solutions (why?) it can even make sense to
combine those.

Deep Learning – 2 / 22
RECAP: ENSEMBLE METHODS

Figure: A cartoon description of bagging (Goodfellow et al. (2016))

Deep Learning – 3 / 22
Dropout

Deep Learning – 4 / 22
DROPOUT
Idea: reduce overfitting in neural networks by preventing complex
co-adaptations of neurons.
Method: during training, random subsets of the neurons are
removed from the network (they are "dropped out").This is done by
artificially setting the activations of those neurons to zero.
Whether a given unit/neuron is dropped out or not is completely
independent of the other units.
If the network has N (input/hidden) units, applying dropout to these
units can result in 2N possible ’subnetworks’.
Because these subnetworks are derived from the same ’parent’
network, many of the weights are shared.
Dropout can be seen as a form of "model averaging".

Deep Learning – 5 / 22
DROPOUT

Deep Learning – 6 / 22
DROPOUT

In each iteration, for each training example (in the forward pass), a
different (random) subset of neurons are dropped out.

Deep Learning – 7 / 22
DROPOUT

In each iteration, for each training example (in the forward pass), a
different (random) subset of neurons are dropped out.

Deep Learning – 8 / 22
DROPOUT

In each iteration, for each training example (in the forward pass), a
different (random) subset of neurons are dropped out.

Deep Learning – 9 / 22
DROPOUT: ALGORITHM
To train with dropout a minibatch-based learning algorithm such as
stochastic gradient descent is used.
For each training case in a minibatch, we randomly sample a
binary vector/mask µ with one entry for each input or hidden unit in
the network. The entries of µ are sampled independently from
each other.
The probability of sampling a mask value of 0 (dropout) for one
unit is a hyperparameter known as the ’dropout rate’.
A typical value for the dropout rate is 0.2 for input units and 0.5 for
hidden units.
Each unit in the network is multiplied by the corresponding mask
value resulting in a subnetµ .
Forward propagation, backpropagation, and the learning update
are run as usual.

Deep Learning – 10 / 22
DROPOUT: ALGORITHM
Algorithm 1 Training a (parent) neural network with dropout rate p
1: Define parent network and initialize weights
2: for each minibatch: do
3: for each training sample: do
4: Draw mask µ using p
5: Compute forward pass for subnetµ
6: end for
7: Update the weights of the (parent) network by performing a gradient descent step
with weight decay
8: end for

The derivatives wrt. each parameter are averaged over the training
cases in each mini-batch. Any training case which does not use a
parameter contributes a gradient of zero for that parameter.

Deep Learning – 11 / 22
DROPOUT: WEIGHT SCALING
The weights of the network will be larger than normal because of
dropout. Therefore, to obtain a prediction at test time the weights
must be first scaled by the chosen dropout rate.
This means that if a unit (neuron) is retrained with probability p
during training, the weight at test time of that unit is multiplied by p.

Credit: Srivastava et. al. (2014)

Weight scaling ensures that the expected total input to a

neuron/unit at test time is roughly the same as the expected total
input to that unit at train time, even though many of the units at
train time were missing on average

Deep Learning – 12 / 22
DROPOUT: WEIGHT SCALING
Rescaling of the weights can also be performed at training time
instead, after each weight update at the end of the mini-batch.
This is sometimes called ’inverse dropout’. Keras and PyTorch
deep learning libraries implement dropout in this way.

Deep Learning – 13 / 22
DROPOUT: EXAMPLE

To demonstrate how dropout

can easily improve
generalization we compute
neural networks with the
structure showed on the right.
Each neural network we fit has
different dropout probabilities,
a tuple where one probability
is for the input layer and one is
for the hidden layers. We
consider the tuples
(0; 0), (0.2; 0.2) and (0.6; 0.5).

Deep Learning – 14 / 22
DROPOUT: EXAMPLE
0.05

0.04

0.03
Dropout rate:
(Input; Hidden Layers)
test error

(0;0)
(0.2;0.2)
0.02 (0.6;0.5)

0.01

0.00

150 160 170 180 190 200

epochs

Dropout rate of 0 (no dropouts) leads to higher test error than dropping
some units out.

Deep Learning – 15 / 22
DROPOUT, WEIGHT DECAY OR BOTH?
0.20

0.15

Test error
comparison
test error

dropout
0.10
dropout + weight decay
unregularized
weight decay

0.05

0.00

0 100 200 300 400 500

epochs

Here, dropout leads to a smaller test error than using no regularization

or solely weight decay.

Deep Learning – 16 / 22
Dataset Augmentation

Deep Learning – 17 / 22
DATASET AUGMENTATION
Problem: low generalization because high ratio of

complexity of the model

#train data
Idea: artificially increase the train data.
Limited data supply → create “fake data”!
Increase variation in inputs without changing the labels.
Application:
Image and Object recognition (rotation, scaling, pixel
translation, flipping, noise injection, vignetting, color casting,
lens distortion, injection of random negatives)
Speech recognition (speed augmentation, vocal tract
perturbation)

Deep Learning – 18 / 22
DATASET AUGMENTATION

Figure: (Wu et al. (2015))

Deep Learning – 19 / 22
DATASET AUGMENTATION

Figure: (Wu et al. (2015))

⇒ careful when rotating digits (6 will become 9 and vice versa)!

Deep Learning – 20 / 22
REFERENCES
Ian Goodfellow, Yoshua Bengio and Aaron Courville (2016)
Deep Learning
http: // www. deeplearningbook. org/
Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky Ilya Sutskever and Ruslan
Salakhutdinov (2012)
Improving neural networks by preventing co-adaptation of feature detectors
http: // arxiv. org/ abs/ 1207. 0580
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky Ilya Sutskever and Ruslan
Salakhutdinov (2012)
Dropout: A Simple Way to Prevent Neural Networks from Overfitting
http: // jmlr. org/ papers/ v15/ srivastava14a. html
Wu Ren, Yan Shengen, Shan Yi, Dang Qingqing and Sun Gang (2015)
Deep Image: Scaling up Image Recognition
https: // arxiv. org/ abs/ 1501. 02876

Deep Learning – 21 / 22
Deep Learning

Challenges in Optimization

Learning goals
Ill-Conditioning
Local Minima
Saddle Points
Cliffs and Exploding Gradients
CHALLENGES IN OPTIMIZATION
In this section, we summarize several of the most prominent
challenges regarding training of deep neural networks.
Traditionally, machine learning ensures that the optimization
problem is convex by carefully designing the objective function and
constraints. But for neural networks we are confronted with the
general nonconvex case.
Furthermore, we will see in this section that even convex
optimization is not without its complications.

Deep Learning – 1 / 34
Ill-Conditioning

Deep Learning – 2 / 34
EFFECTS OF CURVATURE
Intuitively, the curvature of a function determines the outcome of a GD
step. . .

Source: Goodfellow et al., (2016), ch. 4

Figure: Quadratic objective function f (x) with various curvatures. The dashed line
indicates the first order taylor approximation based on the gradient information alone.
Left: With negative curvature, the cost function decreases faster than the gradient
predicts; Middle: With no curvature, the gradient predicts the decrease correctly; Right:
With positive curvature, the function decreases more slowly than expected and begins
to increase.

Deep Learning – 3 / 34
SECOND DERIVATIVE AND CURVATURE
To understand better how the curvature of a function influences the
outcome of a gradient descent step, let us recall how curvature is
described mathematically:
The second derivative corresponds to the curvature of the graph of
a function.
The Hessian matrix of a function R(θ) : Rm → R is the matrix of
second-order partial derivatives

∂2
Hij = R(θ).
∂θi ∂θj

Deep Learning – 4 / 34
SECOND DERIVATIVE AND CURVATURE
The second derivative in a direction d, with ∥d∥ = 1, is given by
d⊤H d.
What is the direction of the highest curvature (red direction), and
what is the direction of the lowest curvature (blue)?

Deep Learning – 5 / 34
SECOND DERIVATIVE AND CURVATURE
Since H is real and symmetric (why?), eigendecomposition yields
H = Vdiag(λ)V−1 with V and λ collecting eigenvectors and
eigenvalues, respectively.
It can be shown, that the eigenvector vmax with the max.
eigenvalue λmax points into the direction of highest curvature
⊤ Hv
(vmax max = λmax ), while the eigenvector vmin with the min.
eigenvalue λmin points into the direction of least curvature.

Deep Learning – 6 / 34
SECOND DERIVATIVE AND CURVATURE
At a stationary point θ , where the gradient is 0, we can examine
the eigenvalues of the Hessian to determine whether the θ is a
local maximum, minimum or saddle point:
∀i : λi > 0 (H positive definite at θ) ⇒ minimum at θ
∀i : λi < 0 (H negative definite at θ) ⇒ maximum at θ
∃ i : λi < 0 ∧ ∃j : λj > 0 (H indefinit at θ) ⇒ saddle point at θ

Credit: Rong Ge (2016)

Deep Learning – 7 / 34
ILL-CONDITIONED HESSIAN MATRIX
The condition number of a symmetric matrix A is given by the ratio of its
|λ |
min/max eigenvalues κ(A) = |λmax| . A matrix is called ill-conditioned, if
min
the condition number κ(A) is very high.

An ill-conditioned Hessian matrix means that the ratio of max. / min.

curvature is high, as in the example below:

Deep Learning – 8 / 34
CURVATURE AND STEP-SIZE IN GD
What does it mean for gradient descent if the Hessian is ill-conditioned?
Let us consider the second-order Taylor approximation as a local
approximation of the of R around a current point θ 0 (with gradient
g)

1
R(θ) ≈ T2 f (θ, θ 0 ) := R(θ 0 )+(θ −θ 0 )⊤ g + (θ −θ 0 )⊤H (θ −θ 0 )
2
Furthermore, Taylor’s theorem states (proof in Koenigsberger
(1997), p. 68)
R(θ) − T2 f (θ, θ 0 )
lim 0 2
=0
θ→θ 0 ||θ − θ ||

Deep Learning – 9 / 34
CURVATURE AND STEP-SIZE IN GD
One GD step with a learning rate α yields new parameters
θ 0 − αg and a new approximated loss value
1
R(θ 0 − αg) ≈ R(θ 0 ) − αg⊤ g + α2 g⊤H g .
2

Theoretically, if g⊤ Hg is positive, we can solve the equation above

for the optimal step size which corresponds to

g⊤ g
α∗ = .
g⊤H g

Deep Learning – 10 / 34
CURVATURE AND STEP-SIZE IN GD
Let us assume the gradient g points into the direction of vmax (i.e.
the direction of highest curvature), the optimal step size is given by

g⊤ g g⊤ g 1
α∗ = = = ,
g⊤H g λmax g⊤ g λmax
which is very small. Choosing a too large step-size is bad, as it
will make us “overshoot” the stationary point.
If, on the other hand, g points into the direction of the lowest
curvature, the optimal step size is

1
α∗ = ,
λmin
which corresponds to the largest possible optimal step-size.
We summarize: We want to perform big steps in directions of low
curvature, but small steps in directions of high curvature.

Deep Learning – 11 / 34
CURVATURE AND STEP-SIZE IN GD
But what if the gradient does not point into the direction of one of
the eigenvectors?
Let us consider the 2-dimensional case: We can decompose the
direction of g (black) into the two eigenvectors vmax and vmin
It would be optimal to perform a big step into the direction of the
smallest curvature vmin , but a small step into the direction of vmax ,
but the gradient points into a completely different direction.

Deep Learning – 12 / 34
ILL-CONDITIONING
GD is unaware of large differences in curvature, and can only walk
into the direction of the gradient.
Choosing a too large step-size will then cause the descent
direction change frequently (“jumping around”).
α needs to be small enough, which results in a low progress.

Deep Learning – 13 / 34
ILL-CONDITIONING
This effect is more severe, if a Hessian has a poor condition
number, i.e. the ratio between lowest and highest curvature is
large; gradient descent will perform poorly.

Figure: The contour lines show a quadratic risk function with a poorly conditioned
Hessian matrix. The plot shows the progress of gradient descent with a small step-size
vs. larger step-size. In both cases, convergence to the global optimum is rather slow.

Deep Learning – 14 / 34
ILL-CONDITIONING
In the worst case, ill-conditioning of the Hessian matrix and a too
big step-size will cause the risk to increase

1
R(θ 0 − αg) ≈ R(θ 0 ) − αg⊤ g + α2 g⊤H g ,
2
which happens if
1 2 ⊤
α g H g > αg⊤ g.
2
To determine whether ill-conditioning is detrimental to the training,
the squared gradient norm g⊤ g and the risk can be monitored.

Deep Learning – 15 / 34
ILL-CONDITIONING

Source: Goodfellow, ch. 6

Gradient norms increase over time, showing that the training

process is not converging to a stationary point g = 0.
At the same time, we observe that the risk is approx. constant, but
the gradient norm increases
1
R(θ 0 − αg) ≈ R(θ 0 ) − αg⊤ g + α2 g⊤H g .
| {z } | {z } 2 | {z }
approx. constant increase →increase

Deep Learning – 16 / 34
Local Minima

Deep Learning – 17 / 34
UNIMODAL VS. MULTIMODAL LOSS SURFACES

Figure: Left: Multimodal loss surface with saddle points; Right: (Nearly)
unimodal loss surface (Hao Li et al. (2017))

Deep Learning – 18 / 34
MULTIMODAL FUNCTION
Potential snippet from a loss surface of a deep neural network with
many local minima:

Deep Learning – 19 / 34
ONLY LOCALLY OPTIMAL MOVES
If the training algorithm makes only locally optimal moves (as in gradient
descent), it may move away from regions of much lower cost.

Source: Goodfellow, Ch. 8

In the figure above, initializing the parameter on the "wrong" side of the
hill will result in suboptimal performance.
In higher dimensions, however, it may be possible for gradient descent to
go around the hill but such a trajectory might be very long and result in
excessive training time.

Deep Learning – 20 / 34
LOCAL MINIMA
Weight space symmetry:
If we swap incoming weight vectors for neuron i and j and do the
same for the outcoming weights, modelled function stays
unchanged.
⇒ with n hidden units and one hidden layer there are n!
networks with the same empirical risk
If we multiply incoming weights of a ReLU neuron with β and
outcoming with 1/β the modelled function stays unchanged.
⇒ The empirical risk of a NN can have very many minima with
equivalent empirical risk.

Deep Learning – 21 / 34
LOCAL MINIMA
In practice only local minima with a high value compared to the
global minimium are problematic.

Source: Goodfellow, Ch. 4

Current literature suspects that most local minima have low

empirical risk.
Simple test: Norm of gradient should get close to zero.

Deep Learning – 22 / 34
Saddle Points

Deep Learning – 23 / 34
SADDLE POINTS
In optimization we look for areas with zero gradient.
A variant of zero gradient areas are saddle points.
For the empirical risk R of a neural network, the expected ratio of
the number of saddle points to local minima typically grows
exponentially with m
R : Rm → R
In other words: Networks with more parameters (deeper networks
or larger layers) exhibit a lot more saddle points than local minima.
Why is that?
The Hessian at a local minimum has only positive eigenvalues. At
a saddle point it is a mixture of positive and negative eigenvalues.

Deep Learning – 24 / 34
SADDLE POINTS
Imagine the sign of each eigenvalue is generated by coin flipping:
In a single dimension, it is easy to obtain a local minimum
(e.g. “head” means positive eigenvalue).
In an m-dimensional space, it is exponentially unlikely that all
m coin tosses will be head.
A property of many random functions is that eigenvalues of the
Hessian become more likely to be positive in regions of lower cost.
For the coin flipping example, this means we are more likely to
have heads m times if we are at a critical point with low cost.
That means in particular that local minima are much more likely to
have low cost than high cost and critical points with high cost are
far more likely to be saddle points.
See Dauphin et al. (2014) for a more detailed investigation.

Deep Learning – 25 / 34
SADDLE POINTS
“Saddle points are surrounded by high error plateaus that can
dramatically slow down learning, and give the illusory impression
of the existence of a local minimum” (Dauphin et al. (2014)).

Deep Learning – 26 / 34
SADDLE POINTS: EXAMPLE

f (x1 , x2 ) = x12 − x22

Along x1 , the function curves upwards (eigenvector of the Hessian

with positive eigenvalue). Along x2 , the function curves downwards
(eigenvector of the Hessian with negative eigenvalue).

Deep Learning – 27 / 34
SADDLE POINTS
So how do saddle points impair optimization?
First-order algorithms that use only gradient information might get
stuck in saddle points.
Second-order algorithms experience even greater problems when
dealing with saddle points. Newtons method for example actively
searches for a region with zero gradient. That might be another
reason why second-order methods have not succeeded in
replacing gradient descent for neural network training.

Deep Learning – 28 / 34
EXAMPLE: SADDLE POINT WITH GD

Red dot: Starting location

Deep Learning – 29 / 34
EXAMPLE: SADDLE POINT WITH GD

First step...

Deep Learning – 29 / 34
EXAMPLE: SADDLE POINT WITH GD

...second step...

Deep Learning – 29 / 34
EXAMPLE: SADDLE POINT WITH GD

...tenth step got stuck and cannot escape the saddle point!

Deep Learning – 29 / 34
Cliffs and Exploding Gradients

Deep Learning – 30 / 34
CLIFFS AND EXPLODING GRADIENTS
As a result from the multiplication of several parameters, the
emprirical risk for highly nonlinear deep neural networks often
contain sharp nonlinearities.
That may result in very high derivatives in some places.
As the parameters get close to such cliff regions, a gradient
descent update can catapult the parameters very far.
Such an occurrence can lead to losing most of the
optimization work that had been done.
However, serious consequences can be easily avoided using a
technique called gradient clipping.
The gradient does not specify the optimal step size, but only the
optimal direction within an infinitesimal region.

Deep Learning – 31 / 34
CLIFFS AND EXPLODING GRADIENTS
Gradient clipping simply caps the step size to be small enough that
it is less likely to go outside the region where the gradient indicates
the direction of steepest descent.
We simply “prune” the norm of the gradient at some threshold h:

h
if ||∇θ|| > h : ∇θ ← ∇θ
||∇θ||

Deep Learning – 32 / 34
EXAMPLE: CLIFFS AND EXPLODING GRADIENTS

Figure: “The objective function for highly nonlinear deep neural networks or
for recurrent neural networks often contains sharp nonlinearities in parameter
space resulting from the multiplication of several parameters. These
nonlinearities give rise to very high derivatives in some places. When the
parameters get close to such a cliff region, a gradient descent update can
catapult the parameters very far, possibly losing most of the optimization work
that had been done” (Goodfellow et al. (2016)).

Deep Learning – 33 / 34
REFERENCES
Ian Goodfellow, Yoshua Bengio and Aaron Courville (2016)
Deep Learning
http: // www. deeplearningbook. org/
Yann Dauphin, Razvan Pascanu, Çaglar Gülçehre, Kyunghyun Cho, Surya
Ganguli, Yoshua Bengio (2014)
Identifying and attacking the saddle point problem in high-dimensional non-convex
optimization
https: // arxiv. org/ abs/ 1406. 2572
Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, Tom Goldstein (2017)
Visualizing the Loss Landscape of Neural Nets
https: // arxiv. org/ abs/ 1712. 09913
Konrad Koenigsberger (1997)
Analysis 2, Springer

Rong Ge (2016)
Escaping from Saddle Points
http: // www. offconvex. org/ 2016/ 03/ 22/ saddlepoints/

Deep Learning – 34 / 34
Deep Learning

Advanced Optimization

Learning goals
SGD with Momentum
Learning Rate Schedules
Adaptive Learning Rates
Batch Normalization
Momentum

Deep Learning – 1 / 47
MOMENTUM
While SGD remains a popular optimization strategy, learning with it
can sometimes be slow.
Momentum is designed to accelerate learning, especially when
facing high curvature, small but consistent or noisy gradients.
Momentum accumulates an exponentially decaying moving
average of past gradients:
h1 X i
ν ← φν − α ∇θ L(y (i ) , f (x (i ) , θ))
m
i
| {z }
gθ
θ ← θ+ν

We introduce a new hyperparameter φ ∈ [0, 1), determining how

quickly the contribution of previous gradients decay.
ν is called “velocity” and derives from a physical analogy
describing how particles move through a parameter space
(Newton’s law of motion).
Deep Learning – 2 / 47
MOMENTUM
So far the step size was simply the gradient g multiplied by the
learning rate α.
Now, the step size depends on how large and how aligned a
sequence of gradients is. The step size grows when many
successive gradients point in the same direction.
Common values for φ are 0.5, 0.9 and even 0.99.
Generally, the larger φ is relative to the learning rate α, the more
previous gradients affect the current direction.
A very good website with an in-depth analysis of momentum:
https://distill.pub/2017/momentum/

Deep Learning – 3 / 47
MOMENTUM: EXAMPLE

ν1 ← φν0 − αg (θ[0] )
θ[1] ← θ[0] + φν0 − αg (θ[0] )
ν2 ← φν1 − αg (θ[1] )
= φ(φν0 − αg (θ[0] )) − αg (θ[1] )
θ[2] ← θ[1] + φ(φν0 − αg (θ[0] )) − αg (θ[1] )
ν3 ← φν2 − αg (θ[2] )
= φ(φ(φν0 − αg (θ[0] )) − αg (θ[1] )) − αg (θ[2] )
θ[3] ← θ[2] + φ(φ(φν0 − αg (θ[0] )) − αg (θ[1] )) − αg (θ[2] )
= θ[2] + φ3 ν0 − φ2 αg (θ[0] ) − φαg (θ[1] ) − αg (θ[2] )
= θ[2] − α(φ2 g (θ[0] ) + φ1 g (θ[1] ) + φ0 g (θ[2] )) + φ3 ν0
t
X
θ[t +1] = θ[t ] − α φj g (θ[t −j ] ) + φt +1 ν0
j =0

Deep Learning – 4 / 47
MOMENTUM: EXAMPLE
Suppose momentum always observes the same gradient g (θ):
t
X
θ[t +1] = θ[t ] − α φj g (θ[t −j ] ) + φt +1 ν0
j =0
t
X
= θ[t ] − αg (θ) φj + φt +1 ν0
j =0

1 − φ t +1
= θ[t ] − αg (θ) + φt +1 ν0
1−φ
1
→ θ[t ] − αg (θ) for t → ∞.
1−φ

Thus, momentum will accelerate in the direction of −g (θ) until reaching terminal
velocity with step size:

1
−αg (θ)(1 + φ + φ2 + φ3 + ...) = −αg (θ)
1−φ

E.g. a momentum with φ = 0.9 corresponds to multiplying the maximum speed by 10

relative to the gradient descent algorithm.

Deep Learning – 5 / 47
MOMENTUM: ILLUSTRATION
The vector ν3 (for ν0 = 0):
ν3 = φ(φ(φν0 − αg (θ[0] )) − αg (θ[1] )) − αg (θ[2] )
= −φ2 (αg (θ[0] )) − φ(αg (θ[1] )) − αg (θ[2] )

Figure: If consecutive (negative) gradients point mostly in the same direction, the
velocity "builds up". On the other hand, if consecutive (negative) gradients point in very
different directions, the velocity "dies down".

Deep Learning – 6 / 47
SGD WITH MOMENTUM

Algorithm Stochastic gradient descent with momentum

1: require learning rate α and momentum φ
2: require initial parameter θ and initial velocity ν
3: while stopping criterion not met do
4: Sample a minibatch of m examples fromP the training set {x̃ (1)
, . . . , x̃ (m) }
1 (i ) (i )
5: Compute gradient estimate: ĝ ← m ∇θ i L y , f x̃ | θ
6: Compute velocity update: ν ← φν − αĝ
7: Apply update: θ ← θ + ν
8: end while

Deep Learning – 7 / 47
SGD WITH MOMENTUM

Figure: The contour lines show a quadratic loss function with a poorly conditioned
Hessian matrix. The two curves show how standard gradient descent (black) and
momentum (red) learn when dealing with ravines. Momentum reduces the oscillation
and accelerates the convergence.

Deep Learning – 8 / 47
SGD WITH AND WITHOUT MOMENTUM
The following plot was created by our Shiny App. On the upper left you can explore
different predefined examples. Click here

Figure: Comparison of SGD with and without momentum on the Styblinkski-Tang

function. The black dot on the bottom left is the global optimum. We can see that SGD
without momentum (red line/points) cannot escape the local minimum, while SGD with
momentum (blue line/dots) is able to escape the local minimum and finds the global
minimum.

Deep Learning – 9 / 47
MOMENTUM IN PRACTICE

Lets try out different values of

momentum (with SGD) on the
MNIST data.
We apply the same
architecture we have used a
dozen of times already (note
that we used φ = 0.9 in all
computations so far, i.e. in
chapter 1 and 2)!

Deep Learning – 10 / 47
MOMENTUM IN PRACTICE

The higher momentum, the faster SGD learns the weights on the training data,
but if momentum is too large, the training and test error fluctuates.

Deep Learning – 11 / 47
MOMENTUM IN PRACTICE

The higher momentum, the faster SGD learns the weights on the training data,
but if momentum is too large, the training and test error fluctuates.

Deep Learning – 12 / 47
NESTEROV MOMENTUM
Momentum aims to solve poor conditioning of the Hessian but also
variance in the stochastic gradient.
Nesterov momentum modifies the algorithm such that the gradient
is evaluated after the current velocity is applied:
h1 X i
ν ← φν − α∇θ L(y (i ) , f (x (i ) , θ + φν))
m
i
θ ← θ+ν

We can interpret Nesterov momentum as an attempt to add a

correction factor to the basic method.
The method is also called Nesterov accelerated gradient (NAG).

Deep Learning – 13 / 47
SGD WITH NESTEROV MOMENTUM

Algorithm Stochastic gradient descent with Nesterov momentum

1: require learning rate α and momentum φ
2: require initial parameter θ and initial velocity ν
3: while stopping criterion not met do
4: Sample a minibatch of m examples from the training set {x̃ (1) , . . . , x̃ (m) }
5: Apply interim update: θ̃ ← θ + φν
Compute gradient estimate: ĝ ← m1 ∇θ̃ i L(y (i ) , f (x (i ) , θ̃))
P
6:
7: Compute velocity update: ν ← φν − αĝ
8: Apply update: θ ← θ + ν
9: end while

Deep Learning – 14 / 47
MOMENTUM VS. NESTEROV MOMENTUM

Credits: Chandra (2015)

Figure: Comparison GD with momentum (left) and GD with Nesterov momentum

(right) for one parameter θ . The first three updates of θ are very similar in both cases
and the updates become larger due to momentum (accumulation of previous negative
gradients). Update 4 is different. In case of momentum, the update overshoots as it
makes an even bigger step due to the gradient history. In contrast, Nesterov
momentum first evaluates a "look-ahead" point θlook_ahead , detects that it overshoots,
and slightly reduces the overall magnitude of the fourth update. Thus, Nesterov
momentum reduces overshooting and leads to smaller oscillations than momentum.

Deep Learning – 15 / 47
Learning Rates

Deep Learning – 16 / 47
LEARNING RATE
The learning rate is a very important hyperparameter.
To systematically find a good learning rate, we can start at a very
low learning rate and gradually increase it (linearly or
exponentially) after each mini-batch.
We can then plot the learning rate and the training loss for each
batch.
A good learning rate is one that results in a steep decline in the
loss.

Credit: jeremyjordan

Deep Learning – 17 / 47
LEARNING RATE SCHEDULE
We would like to force convergence until reaching a local minimum.
Applying SGD, we have to decrease the learning rate over time,
thus α[t ] (learning rate at training iteration t).
The estimator ĝ is computed based on small batches.
Random sampling m training samples introduces noise, that
does not vanish even if we find a minimum.
In practice, a common strategy is to decay the learning rate
linearly over time until iteration τ :
( [0] [τ ]
t
α[0] + τt α[τ ] = t − α +α + α[0] for t ≤ τ

1−
α [t ] = τ τ
α[τ ] for t > τ

Deep Learning – 18 / 47
LEARNING RATE SCHEDULE
Example for τ = 4:
iteration t t /τ α [t ]
α + α = 43 α[0] + 14 α[τ ]
[0] 1 [τ ]
1

1 0.25 1− 4 4
2 0 .5 2
4
α + α[τ ]
[0] 2
4
3 0.75 1
4
α + α[τ ]
[0] 3
4
4 1 0 + α[τ ]
... α[τ ]
t +1 α[τ ]

Deep Learning – 19 / 47
CYCLICAL LEARNING RATES
Another option is to have a learning rate that periodically varies
according to some cyclic function.
Therefore, if training does not improve the loss anymore (possibly
due to saddle points), increasing the learning rate makes it
possible to rapidly traverse such regions.
Recall, saddle points are far more likely than local minima in deep
nets.
Each cycle has a fixed length in terms of the number of iterations.

Deep Learning – 20 / 47
CYCLICAL LEARNING RATES
One such cyclical function is the "triangular" function.

Credit: Hafidz Zulkifli

In the right image, the range is cut in half after each cycle.

Deep Learning – 21 / 47
CYCLICAL LEARNING RATES
Yet another option is to abruptly "restart" the learning rate after a
fixed number of iterations.
Loshchilov et al. (2016) proposed "cosine annealing" (between
restarts).

Credit: Hafidz Zulkifli

Deep Learning – 22 / 47
Algorithms with Adaptive Learning Rates

Deep Learning – 23 / 47
ADAPTIVE LEARNING RATES
The learning rate is reliably one of the hyperparameters that is the
most difficult to set because it has a significant impact on the
models performance.
Naturally, it might make sense to use a different learning rate for
each parameter, and automatically adapt them throughout the
training process.

Deep Learning – 24 / 47
ADAGRAD
Adagrad adapts the learning rate to the parameters.
In fact, Adagrad scales learning rates inversely proportional to the
square root of the sum of the past squared derivatives.
Parameters with large partial derivatives of the loss obtain a
rapid decrease in their learning rate.
Parameters with small partial derivatives on the other hand
obtain a relatively small decrease in their learning rate.
For that reason, Adagrad might be well suited when dealing with
sparse data.
Goodfellow et al. (2016) say that the accumulation of squared
gradients can result in a premature and overly decrease in the
learning rate.

Deep Learning – 25 / 47
ADAGRAD
Algorithm Adagrad
1: require Global learning rate α
2: require Initial parameter θ
3: require Small constant β , perhaps 10−7 , for numerical stability
4: Initialize gradient accumulation variable r = 0
5: while stopping criterion not met do
6: Sample a minibatch of m examples from the training set {x̃ (1) , . . . , x̃ (m) }
1
∇θ i L y (i ) , f x̃ (i ) | θ
P
7: Compute gradient estimate: ĝ ← m
8: Accumulate squared gradient r ← r + ĝ ⊙ ĝ
9: Compute update: ∇θ = − β+α√r ⊙ ĝ (division and square root applied element-wise)
10: Apply update: θ ← θ + ∇θ
11: end while

“⊙” is called Hadamard or element-wise product.

Example:

1 2 5 6 1·5 2·6
A= , B= , then A ⊙ B =
3 4 7 8 3·7 4·8

Deep Learning – 26 / 47
RMSPROP
RMSprop is a modification of Adagrad.
It’s intention is to resolve Adagrad’s radically diminishing learning
rates.
The gradient accumulation is replaced by an exponentially
weighted moving average.
Theoretically, that leads to performance gains in non-convex
scenarios.
Empirically, RMSProp is a very effective optimization algorithm.
Particularly, it is employed routinely by deep learning practitioners.

Deep Learning – 27 / 47
RMSPROP
Algorithm RMSProp
1: require Global learning rate α and decay rate ρ ∈ [0, 1)
2: require Initial parameter θ
3: require Small constant β , perhaps 10−6 , for numerical stability
4: Initialize gradient accumulation variable r = 0
5: while stopping criterion not met do
6: Sample a minibatch of m examples from the training set {x̃ (1) , . . . , x̃ (m) }

Compute gradient estimate: ĝ ← m1 ∇θ i L y (i ) , f x̃ (i ) | θ

P
7:
8: Accumulate squared gradient r ← ρr + (1 − ρ)ĝ ⊙ ĝ
9: Compute update: ∇θ = − β+α√r ⊙ ĝ
10: Apply update: θ ← θ + ∇θ
11: end while

Deep Learning – 28 / 47
ADAM
Adaptive Moment Estimation (Adam) is another method that
computes adaptive learning rates for each parameter.
Adam uses the first and the second moments of the gradients.
Adam keeps an exponentially decaying average of past
gradients (first moment).
Like RMSProp it stores an exponentially decaying average of
past squared gradients (second moment).
Thus, it can be seen as a combination of RMSProp and
momentum.
Basically Adam uses the combined averages of previous gradients
at different moments to give it more “persuasive power” to
adaptively update the parameters.

Deep Learning – 29 / 47
ADAM
Algorithm Adam
1: require Step size α (suggested default: 0.001)
2: require Exponential decay rates for moment estimates, ρ1 and ρ2 in [0, 1) (suggested de-
faults: 0.9 and 0.999 respectively)
3: require Small constant β (suggested default 10−8 )
4: require Initial parameters θ
5: Initialize time step t = 0
6: Initialize 1st and 2nd moment variables s[0] = 0, r[0] = 0
7: while stopping criterion not met do
8: t ←t +1
9: Sample a minibatch of m examples from the training set {x̃ (1) , . . . , x̃ (m) }
1
Compute gradient estimate: ĝ[t ] ← m ∇θ i L y (i ) , f x̃ (i ) | θ
P
10:
[t] [t −1]
11: Update biased first moment estimate: s ← ρ1 s + (1 − ρ1 )ĝ[t ]
12: Update biased second moment estimate: r[t ] ← ρ2 r[t −1] + (1 − ρ2 )ĝ[t ] ⊙ ĝ[t ]
s[t ]
13: Correct bias in first moment: ŝ ←
1−ρt1
r[t ]
14: Correct bias in second moment: r̂ ←
1−ρt2

15: Compute update: ∇θ = −α √ ŝ

r̂+β
16: Apply update: θ ← θ + ∇θ
17: end while

Deep Learning – 30 / 47
ADAM
Adam initializes the exponentially weighted moving averages s
and r as 0 (zero) vectors.
As a result, they are biased towards zero.
This means E[s[t ] ] ̸= E[ĝ[t ] ] and E[r[t ] ] ̸= E[ĝ[t ] ⊙ ĝ[t ] ] (where the
expectations are calculated over minibatches).
To see this, let us unroll the computation of s[t ] for a few
time-steps:
[0]
s =0
[1] [0]
s = ρ1 s + (1 − ρ1 )ĝ[1] = (1 − ρ1 )ĝ[1]
[2]
s = ρ1 s[1] + (1 − ρ1 )ĝ[2] = ρ1 (1 − ρ1 )ĝ[1] + (1 − ρ1 )ĝ[2]
[3]
s = ρ1 s[2] + (1 − ρ1 )ĝ[3] = ρ21 (1 − ρ1 )ĝ[1] + ρ1 (1 − ρ1 )ĝ[2] + (1 − ρ1 )ĝ[3]

ρt1−i g[i ] .
Pt
Therefore, s[t ] = (1 − ρ1 ) i =1
Note that the contribution of the earlier ĝ[i ] to the moving average
shrinks rapidly.

Deep Learning – 31 / 47
ADAM
The expected value of s[t ] is:
t
X
E[s[t ] ] = E[(1 − ρ1 ) ρt1−i ĝ[i ] ]
i =1
t
X
= E[ĝ[t ] ](1 − ρ1 ) ρt1−i + ζ
i =1

= E[ĝ[t ] ](1 − ρt1 ) + ζ

where we approximate ĝ[i ] with ĝ[t ] which allows us to move it

outside the sum. ζ is the error that results from this approximation.
Therefore, s[t ] is a biased estimator of ĝ[t ] and the effect of the
bias vanishes over the time-steps (because ρt1 → 0 for t → ∞).
Ignoring ζ (as it can be kept small), we correct for the bias by
s[t ]
setting ŝ[t ] = (1−ρ t .
)
1

r[t ]
Similarly, we set r̂[t ] = (1−ρt2 )
.

Deep Learning – 32 / 47
COMPARISON OF OPTIMIZERS: ANIMATION

Credits: Dettmers (2015) and Radford

Figure: Excerpts from an animation to compare the behavior of momentum and other
methods compared to SGD for a saddle point. Left: After a few seconds; Right: A bit
later. The animation shows that all showed methods accelerate optimization compared
to the standard SGD. The highest acceleration is obtained using Rmsprop followed by
Adagrad as learning rate strategies. You can find the animation here or click on the
images above.

Deep Learning – 33 / 47
Batch Normalization

Deep Learning – 34 / 47
BATCH NORMALIZATION
Batch Normalization (BatchNorm) is an extremely popular
technique that improves the training speed and stability of deep
neural nets.
It is an extra component that can be placed between each layer of
the neural network.
It works by changing the "distribution" of activations at each hidden
layer of the network.
We know that it is sometimes beneficial to normalize the inputs to
a learning algorithm by shifting and scaling all the features so that
they have 0 mean and unit variance.
BatchNorm applies a similar transformation to the activations of
the hidden layers (with a couple of additional tricks).

Deep Learning – 35 / 47
BATCH NORMALIZATION
For a hidden layer with neurons zj , j = 1, . . . , J, BatchNorm is
applied to each zj by considering the activations of zj over a given
minibatch of inputs.
(i )
Let zj denote the activation of zj for input x (i ) in the minibatch (of
size m).
The mean and variance of the activations are
m
1
P (i )
µj = m
zj
i
m
1 (i )
σj2 (zj − µj )2
P
= m
i
(i )
Each zj is then normalized
(i )
(i ) − µj zj
z̃j =q
σj2 + ϵ

where a small constant, ϵ, is added for numerical stability.

Deep Learning – 36 / 47
BATCH NORMALIZATION
It may not be desirable to normalize the activations in such a rigid
way because potentially useful information can be lost in the
process.
Therefore, we commonly let the training algorithm decide the "right
(i )
amount" of normalization by allowing it to re-shift and re-scale z̃j
(i )
to arrive at the batch normalized activation ẑj :

(i ) (i )
ẑj = γj z̃j + βj

γj and βj are learnable parameters that are also tweaked by

backpropogation.
(i )
ẑj then becomes the input to the next layer.
(i )
Note: The algorithm is free to scale and shift each z̃j back to its
original (unnormalized) value.

Deep Learning – 37 / 47
BATCH NORMALIZATION: ILLUSTRATION
Recall: zj = σ(WjT x + bj )
So far, we have applied batch-norm to the activation zj . It is
possible (and more common) to apply batch norm to WjT x + bj
before passing it to the nonlinear activation σ .

Figure: FC = Fully Connected layer. BatchNorm is applied before the nonlinear

activation function.

Deep Learning – 38 / 47
BATCH NORMALIZATION
The key impact of BatchNorm on the training process is this: It
reparametrizes the underlying optimization problem to make its
landscape significantly more smooth.
One aspect of this is that the loss changes at a smaller rate and
the magnitudes of the gradients are also smaller (see Santurkar et
al. 2018).

Deep Learning – 39 / 47
BATCH NORMALIZATION: PREDICTION
Once the network has been trained, how can we generate a
prediction for a single input (either at test time or in production)?
One option is to feed the entire training set to the (trained) network
and compute the means and standard deviations.
More commonly, during training, an exponentially weighted
running average of each of these statistics over the minibatches is
maintained.
The learned γ and β parameters are then used (in conjunction
with the running averages) to generate the output.

Deep Learning – 40 / 47
BATCH NORMALIZATION
For our final benchmark in this chapter we compute two models to
predict the mnist data.
One will extend our basic architecture such that we add batch
normalization to all hidden layers.
We use SGD as optimizer with a momentum of 0.9, a learning rate
of 0.03 and weight decay of 0.001.

Deep Learning – 41 / 47
BATCH NORMALIZATION

Batch Normalization accelerated learning.

Deep Learning – 42 / 47
BATCH NORMALIZATION

Batch Normalization resulted in a lower test error.

Deep Learning – 43 / 47
REFERENCES
Ian Goodfellow, Yoshua Bengio and Aaron Courville (2016)
Deep Learning
http: // www. deeplearningbook. org/
Yann Dauphin, Razvan Pascanu, Çaglar Gülçehre, Kyunghyun Cho, Surya
Ganguli, Yoshua Bengio (2014)
Identifying and attacking the saddle point problem in high-dimensional non-convex
optimization
https: // arxiv. org/ abs/ 1406. 2572
Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, Tom Goldstein (2017)
Visualizing the Loss Landscape of Neural Nets
https: // arxiv. org/ abs/ 1712. 09913
Tim Dettmers (2015)
Deep Learning in a Nutshell: History and Training
https: // devblogs. nvidia. com/
deep-learning-nutshell-history-training/

Deep Learning – 44 / 47
REFERENCES
Hafidz Zulkifli (2018)
Understanding Learning Rates and How It Improves Performance in Deep
Learning
https: // towardsdatascience. com
Ilya Loshchilov, Frank Hutter (2016)
SGDR: Stochastic Gradient Descent with Warm Restarts
https: // arxiv. org/ abs/ 1608. 03983
Jeremy Jordan (2018)
Setting the learning rate of your neural network
https: // www. jeremyjordan. me/ nn-learning-rate/
Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, Aleksander Madry (2018)
How Does Batch Normalization Help Optimization?
https: // arxiv. org/ abs/ 1805. 11604

Deep Learning – 45 / 47
REFERENCES
Akshay Chandra (2015)
Learning Parameters, Part 2: Momentum-Based & Nesterov Accelerated Gradient
Descent
https: // towardsdatascience. com/
learning-parameters-part-2-a190bef2d12

Deep Learning – 46 / 47
Deep Learning

Modern Activation Functions

Learning goals
Challenges in Optimization
related to Activation Functions
Activations for Hidden Units
Actications for Output Units
Hidden activations

Deep Learning – 1 / 17
HIDDEN ACTIVATIONS
Recall, hidden-layer activation functions make it possible for deep
neural nets to learn complex non-linear functions.
The design of hidden units is an extremely active area of research.
It is usually not possible to predict in advance which activation will
work best. Therefore, the design process often consists of trial and
error.
In the following, we will limit ourselves to the most popular
activations - Sigmoidal activation and ReLU.
It is possible for many other functions to perform as well as these
standard ones. An overview of further activations can be found
here .

Deep Learning – 2 / 17
SIGMOIDAL ACTIVATIONS
Sigmoidal functions such as tanh and the logistic sigmoid bound
the outputs to a certain range by "squashing" their inputs.

In each case, the function is only sensitive to its inputs in a small

neighborhood around 0.
Furthermore, the derivative is never greater than 1 and is close to
zero across much of the domain.

Deep Learning – 3 / 17
SIGMOIDAL ACTIVATION FUNCTIONS
1 Saturating Neurons:
We know: σ 0 (zin ) → 0 for |zin | → ∞.
→ Neurons with sigmoidal activations "saturate" easily, that is,
they stop being responsive when |zin | 0.

Deep Learning – 4 / 17
SIGMOIDAL ACTIVATION FUNCTIONS
2 Vanishing Gradients: Consider the vector of error signals δ (i ) in
layer i

(i ) (i +1) (i +1) 0 (i )
δ =W δ σ zin , i ∈ {1, ..., O }.

Each k -th component of the vector expresses how much the loss L
(i )
changes when the input to the k -th neuron zk ,in changes.
We know: σ 0 (z ) < 1 for all z ∈ R.
→ In each step of the recursive formula above, the value will be
multiplied by a value smaller than one

(i )
δ (1) = W(2) δ (2) σ 0 zin

(i ) (i )
= W(2) W(3) δ (3) σ 0 zin σ 0 zin
= ...

When this occurs, earlier layers train very slowly (or not at all).

Deep Learning – 5 / 17
RECTIFIED LINEAR UNITS (RELU)

The ReLU activation solves the vanishing gradient problem.

In regions where the activation is positive, the derivative is 1.

As a result, the derivatives do not vanish along paths that contain
such "active" neurons even if the network is deep.
Note that the ReLU is not differentiable at 0 (Software
implementations return either 0 or 1 for the derivative at this point).

Deep Learning – 6 / 17
RECTIFIED LINEAR UNITS (RELU)
ReLU units can significantly speed up training compared to units
with saturating activations.

Source : Krizhevsky et al. (2012)

Figure: A four-layer convolutional neural network with ReLUs (solid line) reaches a
25% training error rate on the CIFAR-10 dataset six times faster than an equivalent
network with tanh neurons (dashed line).

Deep Learning – 7 / 17
RECTIFIED LINEAR UNITS (RELU)
A downside of ReLU units is that when the input to the activation is
negative, the derivative is zero. This is known as the "dying ReLU
problem".

When a ReLU unit "dies", that is, when its activation is 0 for all
datapoints, it kills the gradient flowing through it during
backpropogation.
This means such units are never updated during training and the
problem can be irreversible.
Deep Learning – 8 / 17
GENERALIZATIONS OF RELU
There exist several generalizations of the ReLU activation that
have non-zero derivatives throughout their domains.
Leaky ReLU: (
v v ≥0
LReLU (v ) =
αv v < 0

Unlike the ReLU, when the input to the Leaky ReLU activation is
negative, the derivative is α which is a small positive value (such
as 0.01).
Deep Learning – 9 / 17
GENERALIZATIONS OF RELU
A variant of the Leaky ReLU is the Parametric ReLU (PReLU)
which learns the α from the data through backpropagation.
Exponential Linear Unit (ELU):
(
v v ≥0
ELU (v ) = v
α(e − 1) v < 0
Scaled Exponential Linear Unit (SELU):
(
v v ≥0
SELU (v ) = λ v
α(e − 1) v < 0
Note: In ELU and SELU, α and λ are hyperparameters that are set before training.

These generalizations may perform as well as or better than the

ReLU on some tasks.

Deep Learning – 10 / 17
GENERALIZATIONS OF RELU

Source : Zhang et al. 2018

Deep Learning – 11 / 17
Output activations

Deep Learning – 12 / 17
OUTPUT ACTIVATIONS
As we have seen previously, the role of the output activation is to
get the final score on the same scale as the target.
The output activations and the loss functions used to train neural
networks can be viewed through the lens of maximum likelihood
estimation (MLE).
In general, the function f (x | θ) represented by the neural network
defines the conditional p(y | x, θ) in a supervised learning task.
Maximizing the likelihood is then equivalent to minimizing
− log p(y | x, θ).
An output unit with the identity function as the activation can be
used to represent the mean of a Gaussian distribution.
For such a unit, training with mean-squared error is equivalent to
maximizing the log-likelihood (ignoring issues with non-convexity).

Deep Learning – 13 / 17
OUTPUT ACTIVATIONS
Similarly, sigmoid and softmax units can output the parameter(s) of
a Bernoulli distribution and Categorical distribution, respectively.
It is straightforward to show that when the label is one-hot
encoded, training with the cross-entropy loss is equivalent to
maximizing log-likelihood. Click here
Because these activations can saturate, an important advantage of
maximizing log-likelihood is that the log undoes some of the
exponentiation in the activation functions which is desirable when
optimizing with gradient-based methods.
For example, in the case of softmax, the loss is:
g
X
L(y , f (x)) = −fin,k + log exp(fin,k 0 )
k 0 =1

where k is the correct class. The first term, −fin,k , does not
saturate which means training can progress steadily even if the
contribution of fin,k to the second term is negligible.
Deep Learning – 14 / 17
OUTPUT ACTIVATIONS
A neural network can even be used to output the parameters of
more complex distributions.
A Mixture Density Network, for example, outputs the parameters of
a Gaussian Mixture Model:
m
X
p(y |x) = φ(c ) (x) N y ; µ(c ) (x), Σ(c ) (x)
c =1

where m the number of components in the mixture.

In such a network, the output units are divided into groups.
One group of output neurons with softmax activation represents
the weights (φ(c ) ) of the mixture.
Another group with the identity activation represents the means
(µ(c ) ) and yet another group with a non-negative activation
function (such as ReLU or the exponential function) can represent
the variances of the (typically) diagonal covariance matrices Σ(c ) .
Deep Learning – 15 / 17
OUTPUT ACTIVATIONS

Source : Goodfellow et al. (2016)

Figure: Samples drawn from a Mixture Density Network. The input x is sampled from
a uniform distribution and y is sampled from p(y | x, θ).

Deep Learning – 16 / 17
REFERENCES
Ian Goodfellow, Yoshua Bengio and Aaron Courville (2016)
Deep Learning
http: // www. deeplearningbook. org/
Alex Krizhevsky, Ilya Sutskever and Geoffrey E. Hinton (2012)
ImageNet Classification with Deep Convolutional Neural Networks
https: // papers. nips. cc/ paper/
4824-imagenet-classification-with-deep-convolutional-neural-networks
pdf
Guoqiang Zhang and Haopeng Li (2018)
Effectiveness of Scaled Exponentially-Regularized Linear Units
https: // arxiv. org/ abs/ 1807. 10117

Deep Learning – 17 / 17
Deep Learning

Network Initializations

Learning goals
Why Initializaiton matters
Weight Initializations
Bias Initialization
PRACTICAL INITIALIZATION
The weights (and biases) of a neural network must be assigned
some initial values before training can begin.
The choice of the initial weights (and biases) is crucial as it
determines whether an optimization algorithm converges, how fast
and whether to a point with high or low risk.
Initialization strategies to achieve "nice" properties are difficult to
find, because there is no good understanding which properties are
preserved under which circumstances.
In the following we seperate between the initialization of weights
and biases.

Deep Learning – 1 / 10
WEIGHT INITIALIZATION
It is important to initialize the weights randomly in order to "break
symmetry". If two neurons (with the same activation function in a
fully connected network) are connected to the same inputs and
have the same initial weights, then both neurons will have the
same gradient update in a given iteration and they will end up
learning the same features.
Furthermore, the initial weights should not be too large, because
this might result in an explosion of weights or high sensitivity to
changes in the input.
Weights are typically drawn from a uniform distribution or a
Gaussian centered at 0 with a small variance.
Centering the initial weights around 0 can be seen as a form of
regularization and it imposses that it is more likely that units do not
interact with each other than they do interact.

Deep Learning – 2 / 10
WEIGHT INITIALIZATION
Two common initialization strategies for weights are the ’Glorot
initialization’ and ’He initialization’ which tune the variance of these
distributions based on the topology of the network.
Glorot initialization suggests to sample each weight of a fully
connected layer with m inputs and n outputs from a uniform
distribution r r !
6 6
wj ,k ∼ U − ,
m+n m+n

The strategy is derived from the assumption that the network

consists only of a chain of matrix multiplications with no
nonlinearities.

Deep Learning – 3 / 10
WEIGHT INITIALIZATION
He initialization is especially useful for neural networks with
ReLU activations. Each weight of a fully connected layer with m
inputs is sampled from a Gaussian distribution

2
wj ,k ∼ N 0,
m

The underlying derivation can be found in He et. al. (2015).

Since the initialization strategies of Glorot and He depend on the
layer sizes, the initial weights for large layer sizes can become
extremely small.
Another strategy is to treat the weights as hyperparameters that
can be optimized by hyperparameter search algorithms. This can
be computationally costly.

Deep Learning – 4 / 10
WEIGHT INITIALIZATION: EXAMPLE
We use a spiral planar data set to compare the following strategies:
Zero initialization,
random initialization (samples from
N 0, 1 · 10 − 4
) and He initialization.
For each strategy, a neural network with one hidden layer with 100
units, ReLU activation and Gradient Descent as optimizer was
used.

Credit : Ghatak, ch. 4

Figure: Simulated spiral planar data set with two classes.

Deep Learning – 5 / 10
WEIGHT INITIALIZATION: EXAMPLE

Credit: Ghatak (2019), ch. 4

Figure: Decision boundary with zero initialization on the training data set (left)
and the testing data set (right). The zero initialization does not break symmetry
and the complexity of the network reduces to that of a single neuron.

Deep Learning – 6 / 10
WEIGHT INITIALIZATION: EXAMPLE

Credit: Ghatak (2019), ch. 4

Figure: Decision boundary with random initialization (N 0, 1 · 10−4 ) on the

training data set (left) and the testing data set (right).

Deep Learning – 7 / 10
WEIGHT INITIALIZATION: EXAMPLE

Credit: Ghatak (2019), ch. 4

Figure: Decision boundary with He initialization on the training data set (left)
and the testing data set (right).

Deep Learning – 8 / 10
BIAS INITIALIZATION
Typically, we set the biases for each unit to heuristically chosen
constants.
Setting the biases to zero is compatible with most weight
initialization schemes as the schemes expect a small bias.
However, deviations from 0 can be made individually, for example,
in order to obtain the right marginal statistics of the output unit or
to avoid causing too much saturation at the initialization.
For details see Goodfellow et. al (2016).

Deep Learning – 9 / 10
REFERENCES
Ian Goodfellow, Yoshua Bengio and Aaron Courville (2016)
Deep Learning
http: // www. deeplearningbook. org/
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun (2015)
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet
Classification. In Proceedings of the 2015 IEEE International Conference on
Computer Vision (ICCV) (ICCV ’15). IEEE Computer Society, Washington, DC,
USA, 1026-1034.
https: // arxiv. org/ abs/ 1502. 01852
Xavier Glorot and Yoshua Bengio (2010)
Understanding the difficulty of training deep feedforward neural networks
AISTATS, Volume 9 von JMLR Proceedings, Seite 249-256. JMLR.org
http: // proceedings. mlr. press/ v9/ glorot10a/ glorot10a. pdf? hc_
location= ufi
Abhijit Ghatak (2019)
Deep Learning with R. Springer.

Deep Learning – 10 / 10
Deep Learning

CNN: Introduction
Learning goals
What are CNNs?
When to apply CNNs?
A glimpse into CNN architectures
CONVOLUTIONAL NEURAL NETWORKS
Convolutional Neural Networks (CNN, or ConvNet) are a powerful
family of neural networks that are inspired by biological processes
in which the connectivity pattern between neurons resembles the
organization of the mamel visual cortex.

Figure: The ventral (recognition) pathway in the visual cortex has multiple
stage: Retina - LGN - V1 - V2 - V4 - PIT - AIT etc., which consist of lots of
intermediate representations.

Deep Learning – 1 / 11
CONVOLUTIONAL NEURAL NETWORKS
Since 2012, given their success in the ILSVRC competition, CNNs
are popular in many fields.
Common applications of CNN-based architectures in computer
vision are:
Image classification.
Object detection / localization.
Semantic segmentation.
CNNs are widely applied in other domains such as natural
language processing (NLP), audio, and time-series data.
Basic idea: a CNN automatically extracts visual, or, more
generally, spatial features from an input data such that it is able to
make the optimal prediction based on the extracted features.
It contains different building blocks and components.

Deep Learning – 2 / 11
CNNS - WHAT FOR?

Figure: All Tesla cars being produced now have full self-driving hardware
(Source: Tesla website). A convolutional neural network is used to map raw
pixels from a single front-facing camera directly into steering commands. The
system learns to drive in traffic, on local roads, with or without lane markings
as well as on highways.

Deep Learning – 3 / 11
CNNS - WHAT FOR?

Figure: Given an input image, a CNN is first used to get the feature map of the
last convolutional layer, then a pyramid parsing module is applied to harvest
different sub-region representations, followed by upsampling and
concatenation layers to form the final feature representation, which carries
both local and global context information. Finally, the representation is fed into
a convolution layer to get the final per-pixel prediction. (Source: pyramid scene
parsing network, by Zhao et. al, CVPR 2017)

Deep Learning – 4 / 11
CNNS - WHAT FOR?

Figure: Road segmentation (Mnih Volodymyr (2013)). Aerial images and

possibly outdated map pixels are segmented.

Deep Learning – 5 / 11
CNNS - WHAT FOR?
CNN for personalized medicine
Examples:
Tracking, diagnosis and
localization of Covid-19 patients.
CNN
based method (RADLogists)
for personalized Covid-19
detection: three CT scans from
a single Corona virus patient
diagnosed by RADLogists.

Deep Learning – 6 / 11
CNNS - WHAT FOR?

Figure: Four COVID-19 lung CT scans at the top with corresponding colored
maps showing Corona virus abnormalities at the bottom (Source: Megan
Scudellari, IEEE Spectrum 2021).

Deep Learning – 7 / 11
CNNS - WHAT FOR?

Figure: Various analyses in computational pathology are possible. For

example, nuclear segmentation in digital microscopic tissue images enable
extraction of high-quality features for nuclear morphometrics (Source: Kummar
et. al. IEEE Transaction Medical Imaging).

Deep Learning – 8 / 11
CNNS - WHAT FOR?

Figure: Image Colorization is another interesting application of CNN in

computer vision (Zhang et al. (2016)). Given a grayscale photo as the input
(top row), this network solves the problem of hallucinating a plausible color
version of the photo (bottom row, i.e. the prediction of the network).

Deep Learning – 9 / 11
CNNS - WHAT FOR?

Figure: Speech recognition (Anand & Verma (2015)). Convolutional neural

network is used to learn features from the audio data in order to classify
emotions.

Deep Learning – 10 / 11
CNNS - A FIRST GLIMPSE

Input layer takes input data (e.g. image, audio).

Convolution layers extract feature maps from the previous layers.
Pooling layers reduce the dimensionality of feature maps and
filter meaningful features.

Deep Learning – 11 / 11
CNNS - A FIRST GLIMPSE

Fully connected layers connect feature map elements to the

output neurons.
Softmax converts output values to probability scores.

Deep Learning – 11 / 11
Deep Learning

Convolutional Operation

Learning goals
What are filters?
Convolutional Operation
2D Convolution
FILTERS TO EXTRACT FEATURES
Filters are widely applied in Computer Vision (CV) since the 70’s.
One prominent example: Sobel-Filter.
It detects edges in images.

Figure: Sobel-filtered image.

Deep Learning – 1 / 9
FILTERS TO EXTRACT FEATURES
Edges occur where the intensity over neighboring pixels changes
fast.
Thus, approximate the gradient of the intensity of each pixel.
Sobel showed that the gradient image Gx of original image A in
x-dimension can be approximated by:
 
−1 0 +1
Gx = −2 0 +2 ∗ A = Sx ∗ A
−1 0 +1
where ∗ indicates a mathematical operation known as a
convolution, not a traditional matrix multiplication.
The filter matrix Sx consists of the product of an averaging and a
differentiation kernel:
T
1 2 1 −1 0 +1
| {z }| {z }
averaging differentiation

Deep Learning – 2 / 9
FILTERS TO EXTRACT FEATURES
Similarly, the gradient image Gy in y-dimension can be
approximated by:
 
−1 −2 −1
Gy =  0 0 0  ∗ A = Sy ∗ A
+1 +2 +1

The combination of both gradient images yields a

dimension-independent gradient information G:
q
G= G2x + G2y

These matrix operations were used to create the filtered picture of

Albert Einstein.

Deep Learning – 3 / 9
HORIZONTAL VS VERTICAL EDGES

Source: Wikipedia

Figure: Sobel filtered images. Outputs are normalized in each case.

Deep Learning – 4 / 9
FILTERS TO EXTRACT FEATURES

Let’s do this on a dummy image.

How to represent a digital image?

Deep Learning – 5 / 9
FILTERS TO EXTRACT FEATURES

Basically as an array of integers.

Deep Learning – 5 / 9
FILTERS TO EXTRACT FEATURES

Sx enables us to to detect vertical edges!

Deep Learning – 5 / 9
FILTERS TO EXTRACT FEATURES

(Gx )(i ,j ) = (I ? Sx )(i ,j ) = −1 · 0 + 0 · 255 + 1 · 255

−2·0 + 0·0 + 2 · 255
− 1 · 0 + 0 · 255 + 1 · 255

Deep Learning – 5 / 9
FILTERS TO EXTRACT FEATURES

Applying the Sobel-Operator to every location in the input yields

the feature map.

Deep Learning – 5 / 9
FILTERS TO EXTRACT FEATURES

Normalized feature map reveals vertical edges.

Note the dimensional reduction compared to the dummy image.

Deep Learning – 5 / 9
WHY DO WE NEED TO KNOW ALL OF THAT?
What we just did was extracting pre-defined features from our
input (i.e. edges).
A convolutional neural network does almost exactly the same:
“extracting features from the input”.
⇒ The main difference is that we usually do not tell the CNN what
to look for (pre-define them), the CNN decides itself.
In a nutshell:
We initialize a lot of random filters (like the Sobel but just
random entries) and apply them to our input.
Then, a classifier which (e.g. a feed forward neural net) uses
them as input data.
Filter entries will be adjusted by common gradient descent
methods.

Deep Learning – 6 / 9
WHY DO WE NEED TO KNOW ALL OF THAT?

Deep Learning – 7 / 9
WHY DO WE NEED TO KNOW ALL OF THAT?

Deep Learning – 7 / 9
WORKING WITH IMAGES
In order to understand the functionality of CNNs, we have to
familiarize ourselves with some properties of images.
Grey scale images:
Matrix with dimensions height × width × 1.
Pixel entries differ from 0 (black) to 255 (white).
Color images:
Tensor with dimensions height × width × 3.
The depth 3 denotes the RGB values (red - green - blue).
Filters:
A filter’s depth is always equal to the input’s depth!
In practice, filters are usually square.
Thus we only need one integer to define its size.
For example, a filter of size 2 applied on a color image
actually has the dimensions 2 × 2 × 3.

Deep Learning – 8 / 9
THE 2D CONVOLUTION
Suppose we have an input with entries a, b, . . . , i (think of pixel
values).
The filter we would like to apply has weights w11 , w12 , w21 and w22 .

Deep Learning – 9 / 9
THE 2D CONVOLUTION
Suppose we have an input with entries a, b, . . . , i (think of pixel
values).
The filter we would like to apply has weights w11 , w12 , w21 and w22 .

To obtain s11 we simply compute the dot product:

s11 = a · w11 + b · w12 + d · w21 + e · w22

Deep Learning – 9 / 9
THE 2D CONVOLUTION
Suppose we have an input with entries a, b, . . . , i (think of pixel
values).
The filter we would like to apply has weights w11 , w12 , w21 and w22 .

Same for s12 :

s12 = b · w11 + c · w12 + e · w21 + f · w22

Deep Learning – 9 / 9
THE 2D CONVOLUTION
Suppose we have an input with entries a, b, . . . , i (think of pixel
values).
The filter we would like to apply has weights w11 , w12 , w21 and w22 .

As well as for s21 :

s21 = d · w11 + e · w12 + g · w21 + h · w22

Deep Learning – 9 / 9
THE 2D CONVOLUTION
Suppose we have an input with entries a, b, . . . , i (think of pixel
values).
The filter we would like to apply has weights w11 , w12 , w21 and w22 .

And finally for s22 :

s22 = e · w11 + f · w12 + h · w21 + i · w22

Deep Learning – 9 / 9
THE 2D CONVOLUTION
Suppose we have an input with entries a, b, . . . , i (think of pixel
values).
The filter we would like to apply has weights w11 , w12 , w21 and w22 .

s11 = a · w11 + b · w12 + d · w21 + e · w22

s12 = b · w11 + c · w12 + e · w21 + f · w22
s21 = d · w11 + e · w12 + g · w21 + h · w22
s22 = e · w11 + f · w12 + h · w21 + i · w22

Deep Learning – 9 / 9
THE 2D CONVOLUTION
Suppose we have an input with entries a, b, . . . , i (think of pixel
values).
The filter we would like to apply has weights w11 , w12 , w21 and w22 .

More generally, let I be the matrix representing the input and W be

the filter/kernel. Then the entries of the output matrix are defined
P
by sij = m,n Ii +m−1,j +n−1 wmn where m, n denote the image size
and kernel size respectively.

Deep Learning – 9 / 9
Deep Learning

Properties of Convolution

Learning goals
Sparse Interactions
Parameter Sharing
Equivariance to Translation
SPARSE INTERACTIONS

We want to use the “neuron-wise” representation of our CNN.

Moving the filter to the first spatial location yields the first entry of
the feature map which is composed of these four connections.
Deep Learning – 1 / 7
SPARSE INTERACTIONS

Similarly...

Deep Learning – 1 / 7
SPARSE INTERACTIONS

Similarly...

Deep Learning – 1 / 7
SPARSE INTERACTIONS

and finally s22 by these and in total, we obtain 16 connections!

Deep Learning – 1 / 7
SPARSE INTERACTIONS

Assume we would replicate the architecture with a dense net.

Deep Learning – 1 / 7
SPARSE INTERACTIONS

Each input neuron is connected with each hidden layer neuron.

Deep Learning – 1 / 7
SPARSE INTERACTIONS

In total, we obtain 36 connections!

Deep Learning – 1 / 7
SPARSE INTERACTIONS
What does that mean?
Our CNN has a receptive field of 4 neurons.
That means, we apply a “local search” for features.
A dense net on the other hand conducts a “global search”.
The receptive field of the dense net are 9 neurons.
When processing images, it is more likely that features occur at
specific locations in the input space.
For example, it is more likely to find the eyes of a human in a
certain area, like the face.
A CNN only incorporates the surrounding area of the filter into
its feature extraction process.
The dense architecture on the other hand assumes that every
single pixel entry has an influence on the eye, even pixels far
away or in the background.

Deep Learning – 2 / 7
PARAMETER SHARING