0% found this document useful (0 votes)
15 views56 pages

3.NN Backprop

Uploaded by

SteffinNelson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views56 pages

3.NN Backprop

Uploaded by

SteffinNelson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Deep Learning for Computer Vision

Lecture 3
Feed Forward Networks and Back-propagation

Vasileios Belagiannis
Chair of Multimedia Communications and Signal Processing
Friedrich-Alexander-Universität Erlangen-Nürnberg

10.11.2023

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Recapitulation

In last lecture, the topics was the Machine Learning Basics. In detail, we
discussed:
• Machine learning definition.
• Capacity, under-fitting, over-fitting.
• Regularization.
• Maximum Likelihood Estimation.
• Gradient-based optimization.
• Classification.
• Regression.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 2

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Feedforward Neural Network
• A feedforward neural network is a composition of operations that
can be represented by nodes of a directed graph. The neural
network approximates the mapping function f which consists of a
(learnable) parameter set.
• In our context, the neural network is often the mapping function
f : X → Y between the input X , e.g. image, and output Y space,
e.g. object category.
• Learning the mapping, where f is parametrized by w, requires
training data, similar to other machine learning algorithms. Given
the input x ∈ X , the network makes the prediction y = f (x; w).
Composition of functions
It is a composition of differentiable functions. For instance, a 4-layer network:
• f1 = σ(w1 x), where σ(·) is a non-linear activation.
• f2 = pool(f1 ), where pool sub-samples the feature space.
• f3 = σ(w2 f2 ), where σ(·) a is non-linear activation.
• f4 = softmax(w3 f3 ), where we make a class prediction.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 3

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Feedforward Neural Network
• The operations inside a neural network can be linear or non-linear.
• It’s important to be differentiable though.
Learning the parameters of the network
Training a neural network refers to minimizing a loss function L (gradient-based
optimization). The optimization ∇w L involves:
• Back-propagation (a method to compute the gradients of the loss w.r.t.
each variable / parameter).
• Gradient descent (the rule of iteratively applying the gradients to update
the value of the variables / parameters).

• Note that learning the parameters can be also accomplished with:


evolutionary optimization, Bayesian optimization or even random
search.
• The visual representation of a neural network has been mostly
described by artificial neurons or layers in the past.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 4

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Artificial Neuron (Composition of Operations)
• All inputs are multiplied
with the neuron weights
w and summed up
(similar to the x1
convolution operation).
• The bias term b can be
also included.
f = σ(wx + b)
• The summation output
goes through a
non-linear activation x2
(e.g. sigmoid σ(·)).
• Hundreds or thousands
of neurons communicate
with each other through
their activations.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 5

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Multilayer Perceptron (MLP)
• The standard
feedforward network
is the multilayer Input Hidden Output
perceptron (MLP). Layer Layer Layer
This is our reference
model.
• It is a popular Input 1
architecture,
e.g. PointNet [1]. Input 2
• It is traditionally a Output
shallow neural Input 3
network (input,
hidden and output
Input 4
layers).
? Why is it challenging
to add hundreds of
hidden layers?
Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 6

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Neural Network as Computational Graph
Drawing neural networks as structured layers is helpful for visualization.
When it comes to computing the network parameters (also known as
weights), we build the computational graph.
• A computational graph is a directed graph.
• The nodes correspond to operations.
• The edges correspond to the data flow.
• The computational graph can represent functions such as
f (x, y , z) = (x − y )z. It is drawn as:
x
x-y g
f
y g×z

z
• Note that instead of writing the variable names on the edges (data
flow), we could use nodes to represent variables too.
• We represent neural networks as computational graphs with
forward and backward passes.
Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 7

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Forward Pass

It refers to evaluating the function for a set of inputs, where the


function describes the neural network. It is the process of making a
prediction with a neural network.
x
x-y g
f
y g×z

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 8

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Backward Pass
Most of the graph variables are learned using gradient-based
optimization. The backward pass calculates the gradients of all
variables. It takes place after the forward pass.
x
x-y g
f
y g×z

• This is the crucial step for training the neural network.


? On the above graph, consider x and y as input terms, g as
intermediate term, z as model term and f as output term. Which
terms will not be updated?
• Expressing the neural network as a computational graph, allows us
to model complex network compositions / functions, where the
gradient computation can be automatized.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 9

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Back-propagation Example

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 10

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Computational Graph

• Consider the
function: x
f (x, y , z) = (x −y )z,
- g
represented by a f
computational y ×
graph.
• x = 5, y = 2 and z
z = −4 are the
inputs to the graph.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 11

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Computational Graph (Forward Pass)

• Consider the
function:
f (x, y , z) = (x −y )z, x 5
represented by a - 3g
computational f
y 2 × -12
graph.
• x = 5, x = 2 and
z -4
z = −4 is the input
to the graph.
• The output is −12.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 12

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Computational Graph (Backward Pass)
• f (x, y , z) = (x − y )z x 5
for x = 5, x = 2, - 3g
f
z = −4. y 2 × -12
• g =x −y
• f =g ×z z -4

Derivatives of the graph


• Goal? Find how the input should influence the output.
• How? Partial derivatives of the output (function) w.r.t to input.
• We need to compute: ∂f ∂f ∂f
,
∂x ∂y
, ∂z
• How? The chain rule.
• ∂f
∂f
= 1.
• ∂f ∂f
∂g
= z, ∂z
= g , where f = g × z and g = x − y = 3
• ∂g ∂g
∂x
= 1, ∂y
= −1, where g = x − y
• ∂f ∂f ∂g ∂f ∂f ∂g ∂f
∂x
= ∂g ∂x
= z · 1 = −4 · 1 = −4, ∂y
= ∂g ∂y
= 4, ∂z
=3

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 13

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Computational Graph (Backward Pass)
• Apply the chain rule
from the end to the
start.
• Every node has input
and output variables. ∂f ∂f
= −4
∂x
= −4 x 5 ∂g
We need the gradient
for each of them. - 3g
∂f f
∂f ∂f ∂g =4 y 2 × -12
• ∂x = ∂g ∂x : During the
∂y

gradient computation
∂f
there is the upstream ∂z
=3 z -4
∂f
gradient ∂g and the
∂g
local gradient ∂x .
• ∂f ∂f ∂g
Similarly: ∂y = ∂g ∂y ,
∂f ∂f ∂f ∂f ∂f ∂f
∂z = ∂f ∂z , ∂g = ∂f ∂g .

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 14

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Forward and Backward Pass
These two steps compose the main
ingredients for training a neural network.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 15

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Computational Graph (Gradient Flow)

During the forward pass, the operation f takes as input x and y to


produce z.

z
f

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 16

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Computational Graph (Gradient Flow)
∂z ∂z
During the backward pass, the local gradients ( ∂x and ∂y ) are
computed. Note that x and y, as well as z are related through the
operation f.

∂z
∂x
z
f
∂z
∂y
y

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 17

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Computational Graph (Gradient Flow)
At the beginning of the backward pass, the upstream gradient is w.r.t
the loss L, i.e. output of the last graph operation. For a neural network,
the loss is usually the last operation. It comes after the prediction.

x
∂L ∂L ∂z
∂x = ∂z ∂x
∂z
∂x
z
f
∂z ∂L
∂y ∂z
y

∂L ∂L ∂z
∂y = ∂z ∂y

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 18

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Back-propagation

A few observations:
• It is effectively the chain rule.
• We back-propagate to calculate the gradients of the network
parameters.
? How do we make use of the gradients to update the network
parameters?
• The parameters are updated with gradient descent.
• It is a repetitive process that can be efficiently implemented to
avoid computational overhead [2].
? Does back-propagation apply to unsupervised learning too?

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 19

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Back-propagation Linear Regression Example

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 20

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Linear Regression Backprop

x
×
w0
+ - (·)2
w1

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 21

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Linear Regression Backprop
x op1 = xw0 , op2 = op1 + w
× op1 op3 = y − op2 , L = op32
w0
op2 op3 L
+ - (·)2
w1

Derivatives of the graph


• Mapping: f = w0 x + w1 , where initial value of the parameters are
w0 = 0.1, w1 = 0.1 and x = 2, y = 1.5. x is the data and y the label.
• Goal: Find the best parameter values and thus ∂L ∂L
,
∂w0 ∂w1
.
• We start from the back to front to compute all derivatives. We treat each node
individually with local and up-stream derivatives.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 22

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Linear Regression Backprop
x 2 0.2 op1 = xw0 , op2 = op1 + w
× op1 op3 = y − op2 , L = op32
w0 0.1 0.3 1.2 1.44
op2 op3 2 L
+ - (·)
w1 0.1

y 1.5

Derivatives of the graph

• Node (·)2 , where the node operation is op32 : ∂L


∂op3
= 2op3 = 2.4 (local gradient)
∂L ∂L ∂L ∂L
and ∂L
= 1 (upstream). Thus ∂op3
= ∂L ∂op3
= 2.4.

• Node −, where the node operation is op3 = y − op2 : ∂op3 ∂op3


∂op2
= −1, ∂y
=1
∂L
(local gradients). The upstream gradient is ∂op3
= 2.4 and thus
∂L ∂L ∂op3 ∂L ∂L ∂op3
∂op2
= ∂op3 ∂op2
= −2.4, ∂y
= ∂op3 ∂y
= 2.4.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 23

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Linear Regression Backprop
x 2 0.2 op1 = xw0 , op2 = op1 + w
× op1 op3 = y − op2 , L = op32
w0 0.1 0.3 1.2 1.44
op2 op3 2 L
+ - (·)
w1 0.1

y 1.5

Derivatives of the graph

• Node +, where the node operation is op2 = op1 + w1 : ∂op2 ∂op2


∂op1
= 1, ∂w1
= 1 (local
∂L
gradients); ∂op = −2.4 (upstream gradient). Thus
2
∂L ∂L ∂op2 ∂L ∂L ∂op2
∂op1
= ∂op2 ∂op1
= −2.4, ∂w = ∂op = −2.4.
1 2 ∂w1
• Node ×, where the node operation is op1 = xw0 :
∂op1
∂x
= w0 = 0.1, ∂op
∂w
1
= x = 2 (local gradients); ∂L
∂op1
= −2.4 (upstream
0
∂L ∂L ∂op1 ∂L ∂L ∂op1
gradient). Finally ∂x
= ∂op1 ∂x
= −0.48, ∂w0
= ∂op ∂w = −4.8.
1 0

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 24

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Backpopagation → Gradient Descent

• After computing all gradients, the update steps takes place for the
learnable parameters.
? Which parameters, in the above example, should be updated?
• The optimization algorithm is (mini-batch) gradient descent.
• The parameter update is given by:

∂L
w0 = w0 − η , (1)
∂w0
∂L
w1 = w1 − η . (2)
∂w1
• The process of forward, backward pass and parameter update is
iterative, until convergence.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 25

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Linear Regression Backprop - Code
1 # data and label
2x = 2
3 y = 1.5
4
5 # initial parameter values
6 w0 = w1 = 0.1
7
8 for _ in range (100) :
9 # operations
10 op1 = x * w0
1
11 op2 = op1 + w1
2
12 op3 = y - op2
3 # addition node : x * w0
13 L = op3 **2
4 # local gradient ( s )
14
5 dop1_dx = w0
15 # exponent node : op3 ^2
6 dop1_dw0 = x
16 # local gradient ( s )
7 # output gradient ( s )
17 dL_dop3 = 2* op3
8 dL_dx = dL_dop1 * dop1_dx
18 # upstream gradient
9 dL_dw0 = dL_dop1 * dop1_dw0
19 dL_dL = 1
10
20 # output gradient ( s )
11 # update parameters ( lr , learning rate )
21 dL_dop3 = dL_dL * dL_dop3
12 lr = 0.01
22
13 w0 = w0 - lr * dL_dw0
23 # subtraction node : op2 - y
14 w1 = w1 - lr * dL_w1
24 # local gradient ( s )
15 f = w0 * x + w1
25 dop3_dop2 = -1
16
26 dop3_dy = 1
17 print ( ’ Loss : {:.3} and parameters
27 # output gradient ( s )
{:.3} ,{:.3} ’. format (L , w0 , w1 ) )
28 dL_dop2 = dL_dop3 * dop3_dop2
29 dL_y = dL_dop3 * dop3_dy
30
31 # addition node : op1 + w1
32 # local gradient ( s )
33 dop2_dop1 = 1
34 dop2_dw1 = 1
35 # output gradient ( s )
36 dL_dop1 = dL_dop2 * dop2_dop1
37 dL_w1 = dL_dop2 * dop2_dw1
Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 26

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Linear Regression Backprop - PyTorch Code

1 import torch
2
3 x = torch . tensor ([2.0])
4 y = torch . tensor ([1.5])
5
6 w0 = torch . tensor ([0.1] , requires_grad = True )
7 w1 = torch . tensor ([0.1] , requires_grad = True )
8
9 for _ in range (100) :
10 f = ( x * w0 ) + w1
11 loss = (y - f ) . pow (2) . sum ()
12 loss . backward ()
13
14 lr = 1e -2
15 with torch . no_grad () :
16 w0 -= lr * w0 . grad
17 w1 -= lr * w1 . grad
18
19 w0 . grad . zero_ ()
20 w1 . grad . zero_ ()
21 print ( ’ Loss : {:.3} and parameters {:.3} ,{:.3} ’. format ( loss . item () , w0 . item () , w1 . item () ) )

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 27

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


PyTorch Implementation

? What is missing from the PyToch implementation, compared to


python code from above?
• The gradients can be computed automatically with automatic
differentiation (autodiff / autograd) [3].
• Every computer program consists of a sequence of elementary
arithmetic operations and elementary functions.
• Applying repeatedly the chain rule, results in computing the
gradients.
• Most modern DL frameworks, as PyTorch, work with autodiff.
• Programming new operations (layers) for neural networks involves
defining the forward and backward step. With autodiff, only the
forward step is necessary. Note without autodiff, it would be
important to compute the numerical gradients for controlling the
backward step’s gradients.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 28

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Linear Regression Backprop - Observations

• A simple case of a neural network.


• Deep neural networks use the same training principles at large scale.
• It is important to efficiently program the forward and backward
operations to generalize on millions of parameters and training data.
• In the general case, the input has vector form.
• The Jacobian matrix is then computed for all variables of an
operation.
• The gradient descent algorithm is applied on the same way on
vectors.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 29

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Jacobian Matrix

Fo example, consider the function f : R4 → R3 with input x and output


z. The derivative of each output element w.r.t each input element of
input is the Jacobian that is given by:
 ∂z ∂z ∂z ∂z

1 1 1 1
∂x ∂x2 ∂x3 ∂x4
 ∂z1 ∂z2 ∂z2 ∂z2 

J= 2
 ∂x1 ∂x2 ∂x3 ∂x4  (3)
∂z3 ∂z3 ∂z3 ∂z3
∂x1 ∂x2 ∂x3 ∂x4

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 30

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Back-propagation and Vectorization

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 31

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Computational Graph (Vectorized Operations)
The main difference of having vector input / output is on the backward
step. The Jacobian matrix has now to be computed for one or more
pairs of input and output.

x
∂L ∂L ∂z
∂x = ∂z ∂x
∂z
For example if ∂x
z
x and z are vec- f
∂z ∂z ∂L
tors, then ∂x ∂y ∂z
corresponds to y
the Jacobian.

∂L ∂L ∂z
∂y = ∂z ∂y

∂z
• Similarly, we compute the Jacobian for ∂y .
? Do we compute the Jacobian for the loss L as well?

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 32

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Back-propagation, Vectorized Example

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 33

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Vectorized Example

Consider the function:


f (x, W) = W × x
where x ∈ Rn×n and
W ∈ Rn×n . The W
operation represented X0 P X1 X2 L
× - ||
as the computational
graph is the
Hadamard product, x y
including the loss
function
Pn×n L =
| i=1 f (x, W) − y |.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 34

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Vector (Forwards -Backward
 
and Parameter Update)
0.1 0.2  
0.20.6
The Jacobian or 0.3 0.2
1.21.6 3.6 −6.4 6.4
derivatives are W
X0 P X1 X2 L
computed during   × - ||
2.0 3.0
back-propagation. 4.0 8.0 10.0
x y
Back-propagation

• Node ||, where the node operation is L = |X2 |: ∂L X2


∂X2
= |X2 |
= −1 (local
∂L
gradient) and ∂L
= 1 (upstream gradient).
• Node −, where the node operation is X2 = X1 − y : ∂X2
∂X1
= 1 (local gradient)
∂L ∂L ∂L ∂X2
and ∂X2
= −1 (upstream gradient). ∂X1
= ∂X2 ∂X1
= −1.
• Node
P
, where the operation is "X1 = X0 (0, 0) + X0# (0, 1) + X0 (1, 0) + X0 (1, 1):
∂X1 ∂X1  
∂X1 1 1
∂X0 (i,j)
The Jacobian matrix is ∂X∂X
. 0 (0,0) ∂X0 (0,1)
∂X1 = (local
1 1 1
∂X0 (1,0) ∂X
 0 (1,1) 
∂L ∂L ∂L ∂X1 −1 −1
gradients) and ∂X = −1. ∂X = ∂X = .
1 0 1 ∂X0 −1 −1

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 35

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Vector (Forwards -Backward
 
and Parameter Update)
We skip the gradient 0.1 0.2  
0.2 0.6
of x and focus only 0.3 0.2
1.2 1.6 3.6 −6.4 6.4
on the gradients of W
X0 P X1 X2 L
W because these are   × - ||
2.0 3.0
the parameters to be 4.0 8.0 10.0
learned. x y
Back-propagation

• Node ×, where the operation is X0 = Wx (elementwise multiplication): For ∂X0


" ∂X ∂W
∂X0
# 
0 
∂W(0,0) ∂W(0,1) 2 3
(local gradient), the Jacobian matrix is ∂X0 ∂X0 = . The
4 8
∂W(1,0) ∂W(1,1)  
∂L ∂L ∂L ∂X0 −2 −3
upstream gradient ∂X is known and thus ∂W = ∂X = . Note
0 0 ∂W −4 −8
the elementwise multiplication.
• Finally we update the parameters with gradient descent: W = W − η ∂L .
∂W
• The forward step, backward step and parameter update are iteratively applied
until convergence.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 36

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Vectorized Examples (PyTorch)

1 import torch
2
3 W = torch . tensor ([[0.1 , 0.2] , [0.3 , 0.2]] , requires_grad = True )
4 x = torch . tensor ([[2.0 , 3.0] , [4.0 , 8.0]])
5 y = 10
6
7 for _ in range (1000) :
8 X = []
9 X . append ( torch . mul (W , x ) )
10 X . append ( torch . sum ( X [ -1]) )
11 X . append ( X [ -1] - y )
12
13 # store the gradients
14 X [0]. retain_grad ()
15 X [1]. retain_grad ()
16 X [2]. retain_grad ()
17
18 loss = torch . abs ( X [ -1])
19 loss . backward ()
20
21 lr = 1e -4
22 with torch . no_grad () :
23 W -= lr * W . grad
24 W . grad . zero_ ()
25 print ( loss . item () )

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 37

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Vectorized Example (Observations)

• Back-propagation and gradient descent works on the same way with


vectors and scalars.
• When working with vectors, the gradients of a variable should
have the same dimension with the variable.
• The operations can be also described as layers or modules.
• It is a requirement for a module / layer to be differentiable.
• A few popular modules are linear, ReLU, sigmoid, tanh, Softmax,
L2-loss or cross entropy loss.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 38

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Neural Network Modules

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 39

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Neural Network Modules

z
f

• f → differentiable operation.
• x, y → inputs that corresponds to scalar, vectors, matrices or
tensors.
• z → output represented by scalar, vectors, matrices or tensors.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 40

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Identity

• Function: f (x) = x.
• Range: (−∞, ∞).
• Derivative: f 0 (x) = 1.
• The derivative is continuous
and there are no vanishing
gradients.
• It is common activation for
regression problems.
? Name applications that require
the identity module as output.
• One usually finds it as the last
layer / module of a neural
network.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 41

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Rectified Linear Unit (ReLU) [4]
• Function:

x x ≥0
f (x) =
0 x <0
• Range: [0, ∞).
• Derivative:

0 1 x ≥0
f (x) =
0 x <0
• Derivative range: {0, 1}.
• It is the most common
activation function in the deep
learning era.
• The derivative is not
continuous as 0.
• It causes problems for
non-positive inputs.
? What is the motivation behind
ReLU?
Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 42

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Leaky Rectified Linear Unit (ReLU) [5]
• Function:

x x ≥0
f (x) =
0.5x x <0
• Range: (−∞, ∞).
• Derivative:

0 1 x ≥0
f (x) =
0.5 x <0
• Derivative range: {0.5, 1}.
• The main difference from ReLU
is that is allows negative input
to pass with a small slope.
• The derivative is not
continuous as 0.
• The common coefficient is 0.01.
? Is Leaky ReLU a good
activation for regression?

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 43

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Exponential Linear Unit (ELU) [6]
• Function:

x x ≥0
f (x) =
α(e x − 1) x <0
• Range: [−α, ∞) for α ≥ 0 and
[0, ∞) for α < 0.
• Derivative:

0 1 x ≥0
f (x) =
αe x x <0
• Derivative range: [ 0, α) ∪ {1}.
• It usually converges faster than
ReLU and delivers better
performance.
• It is smoother than ReLU.
? Is ELU a good activation for
regression?

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 44

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Logistic Sigmoid
• Function: f (x) = 1
1+e −x .
• Range: [0, 1].
• Derivative: f 0 (x) = e −x
(1+e −x )2 .
• Derivative range: (0, 0.25].
• The derivative is continuous,
but there are vanishing
gradients and saturation.
• It is known from the logistic
regression.
• It is used for output activation
on classification problems.
? What is the advantage by
bounding the range of the
function?

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 45

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Hyperbolic Tangent (tanh)

• Function: f (x) = e x −e −x
e x +e −x .
• Range: (−1, 1).
• Derivative: f 0 (x) = 1 − f (x)2 .
• Derivative range: (0, 1].
• The derivative is continuous,
but there are vanishing
gradients and saturation.
• It is used for output activation
on classification problems.
? What is the main difference
from sigmoid?

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 46

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Softmax
• The standard activation for multi-class problems. For K -class
prediction, there will be K activations (possible outputs).
x
• Function: f (x, i) = PKe i xj . Note that K
P
e j=1 f (x, j) = 1.
j=1
• Range: [0, 1].
• The input and output of the softmax is represented by a vector (of
K-elements). As a result, we need to compute the Jacobian during
the backward step.
 ∂f 1 1

∂x1 · · · ∂f
∂xk
 ∂f 2 2
 ∂x · · · ∂f 
• Jacobian:  1 ∂xk 
· · · · · · · · · 
 
∂f k ∂f k
∂x1 · · · ∂xk

∂f i
• Derivative rule: ∂x j = f (x, i)(1 − f (x, j)) i =j
−f (x, i)f (x, j) i 6= j
? This is an expensive operation. Do we need to compute the
complete Jacobian matrix?

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 47

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Softmax (Numerical Stability)
• To avoid having very small number and very large numbers at the
power of the exponent, we can normalize the inputs.
• Common errors: Errors would be RuntimeWarning: overflow
encountered in exp expo = np.exp(x) or RuntimeWarning:
invalid value encountered in true divide return
expo/np.sum(expo).
• We introduce a constant C at the softmax:
xi xi
f (x, i) = PKe e xj = PKCe Ce xj .
j=1 j=1
• Next we bring the constant to the exponent:
xi ln(C ) xi
PKCe x = PKe ln(C e e xi +ln(C )
Ce j e ) e xj = PK xj +ln(C ) .
j=1 j=1 j=1 e
• We can replace ln(C ) with D and treat it as an arbitrary constant:
e xi +ln(C ) e xi +D
PK xj +ln(C ) = PK xj +D .
j=1 e j=1 e
• Finally, we choose D = −max(x) that shifts the input closer to zero.
There will be negative values, other than the maximum. As a result,
very large exponents will be shifted to zero.
? What could happen if we don’t shift the values to be negative?

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 48

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Softmax (Code)

1 import numpy as np
2
3 x = []
4 x . append ([1 , 2 , 3])
5 x . append ([1 e -20 , 2e -20 , 3e -20])
6 x . append ([1000 , 2000 , 3000])
7
8
9 def softmax ( x ) :
10 expo = np . exp ( x )
11 return expo / np . sum ( expo )
12
13
14 def softma x_stable ( x ) :
15 c = x - np . max ( x )
16 expo = np . exp ( c )
17 return expo / np . sum ( expo )
18
19
20 for v in x :
21 print ( softmax ( v ) )
22 print ( soft max_stab le ( v ) )
23 print ( ’\ n ’)

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 49

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Softmax (Observations)

• We find softmax mostly as the last module of a network. It has been


recently used as attention mechanism as well [7].
• The input to the softmax is called logits and it expresses
non-normalized probabilities.
• The output of softmax is a probability distribution.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 50

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Cross Entropy
• Cross entropy measures the distance between two probability
distributions (in our context).
• One probability distribution corresponds to the softmax output of
the neural network and the other to the ground-truth.
• The ground-truth is the one-hot vector that is defined as a
K-element vector. All of its elements are zero other than the correct
class that is set to one.
• The cross entropy is given by ce(p, q) = − k p(k) log(q(k)),
P
where p and q are compared probability distributions.
• In our problem, we have K classes. We define the ground-truth
distribution as y and the softmax
PK output as p. The cross entropy is
re-written as: ce(y, p) = − k=1 y (k) log(p(k)), where p(k) is the
probability predicted by the model and y (k) the ground-truth
probability.
• y (k) = 1 only for the ground-truth class and the cross entropy can
be now written as: ce(y, p) = − log(p(y )), where y now refers to
the ground-truth class (slight abuse the y notation).

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 51

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Cross Entropy (cont.)

• We can further simplify the formulation by considering only p as


variable and the ground-truth class as constant. This makes the
cross entropy: ce(p) = − log(py ).
• The input to the cross entropy is K predictions and the output a
scalar. Consequently, we need to compute the Jacobian in the
backward step.
h i
• Jacobian: ∂ce∂p1
∂ce
· · · ∂p K
.
• However the cross entropy function is evaluated only for the correct
class y and thus the Jacobian is everywhere
h zero other than
i the
∂ce
correct class. It can be written as: 0 · · · ∂py 0 · · · .
• The combination of the cross entropy with the softmax function is
the standard approach for multi-class classification problems.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 52

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


L1 and L2 loss

These are the most common loss functions for regression problems. The
target y of the prediction x can vary from a set of values to matrices and
tensors.
L1-Loss L2-Loss (Mean squared difference)
• It minimizes the absolute • It minimizes the squared
difference between the difference between the
prediction and target. prediction and target.
• It can be written as: • It can be written as:
PK PK
L = i |yi − xi |, where K is L = i (yi − xi )2 , where K is
the output dimension. the output dimension.
Note that the L2-loss can have square root in addition.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 53

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Study Material

• Book: Deep Learning(Ian Goodfellow and Yoshua Bengio and Aaron


Courville), Chapter 6, Deep Feedforward Networks. On-line available
at: https://www.deeplearningbook.org.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 54

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Next Lecture

Optimization

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 55

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


References I
[1] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point
sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 652–660, 2017.
[2] Yann A LeCun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. Efficient backprop.
In Neural networks: Tricks of the trade, pages 9–48. Springer, 2012.
[3] Atilim Gunes Baydin, Barak A Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark
Siskind. Automatic differentiation in machine learning: a survey. Journal of Marchine Learning
Research, 18:1–43, 2018.
[4] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann
machines. In Proceedings of the 27th international conference on machine learning (ICML-10),
pages 807–814, 2010.
[5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers:
Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE
international conference on computer vision, pages 1026–1034, 2015.
[6] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep
network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.
[7] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural
Information Processing Systems, pages 5998–6008, 2017.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 56

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

You might also like