0% found this document useful (0 votes)

15 views56 pages

3.NN Backprop

Uploaded by

SteffinNelson

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views56 pages

3.NN Backprop

Uploaded by

SteffinNelson

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 56

Deep Learning for Computer Vision

Lecture 3
Feed Forward Networks and Back-propagation

Vasileios Belagiannis
Chair of Multimedia Communications and Signal Processing
Friedrich-Alexander-Universität Erlangen-Nürnberg

10.11.2023

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Recapitulation

In last lecture, the topics was the Machine Learning Basics. In detail, we
discussed:
• Machine learning definition.
• Capacity, under-fitting, over-fitting.
• Regularization.
• Maximum Likelihood Estimation.
• Gradient-based optimization.
• Classification.
• Regression.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 2

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Feedforward Neural Network
• A feedforward neural network is a composition of operations that
can be represented by nodes of a directed graph. The neural
network approximates the mapping function f which consists of a
(learnable) parameter set.
• In our context, the neural network is often the mapping function
f : X → Y between the input X , e.g. image, and output Y space,
e.g. object category.
• Learning the mapping, where f is parametrized by w, requires
training data, similar to other machine learning algorithms. Given
the input x ∈ X , the network makes the prediction y = f (x; w).
Composition of functions
It is a composition of differentiable functions. For instance, a 4-layer network:
• f1 = σ(w1 x), where σ(·) is a non-linear activation.
• f2 = pool(f1 ), where pool sub-samples the feature space.
• f3 = σ(w2 f2 ), where σ(·) a is non-linear activation.
• f4 = softmax(w3 f3 ), where we make a class prediction.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 3

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Feedforward Neural Network
• The operations inside a neural network can be linear or non-linear.
• It’s important to be differentiable though.
Learning the parameters of the network
Training a neural network refers to minimizing a loss function L (gradient-based
optimization). The optimization ∇w L involves:
• Back-propagation (a method to compute the gradients of the loss w.r.t.
each variable / parameter).
• Gradient descent (the rule of iteratively applying the gradients to update
the value of the variables / parameters).

• Note that learning the parameters can be also accomplished with:

evolutionary optimization, Bayesian optimization or even random
search.
• The visual representation of a neural network has been mostly
described by artificial neurons or layers in the past.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 4

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Artificial Neuron (Composition of Operations)
• All inputs are multiplied
with the neuron weights
w and summed up
(similar to the x1
convolution operation).
• The bias term b can be
also included.
f = σ(wx + b)
• The summation output
goes through a
non-linear activation x2
(e.g. sigmoid σ(·)).
• Hundreds or thousands
of neurons communicate
with each other through
their activations.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 5

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Multilayer Perceptron (MLP)
• The standard
feedforward network
is the multilayer Input Hidden Output
perceptron (MLP). Layer Layer Layer
This is our reference
model.
• It is a popular Input 1
architecture,
e.g. PointNet [1]. Input 2
• It is traditionally a Output
shallow neural Input 3
network (input,
hidden and output
Input 4
layers).
? Why is it challenging
to add hundreds of
hidden layers?
Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 6

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Neural Network as Computational Graph
Drawing neural networks as structured layers is helpful for visualization.
When it comes to computing the network parameters (also known as
weights), we build the computational graph.
• A computational graph is a directed graph.
• The nodes correspond to operations.
• The edges correspond to the data flow.
• The computational graph can represent functions such as
f (x, y , z) = (x − y )z. It is drawn as:
x
x-y g
f
y g×z

z
• Note that instead of writing the variable names on the edges (data
flow), we could use nodes to represent variables too.
• We represent neural networks as computational graphs with
forward and backward passes.
Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 7

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Forward Pass

It refers to evaluating the function for a set of inputs, where the

function describes the neural network. It is the process of making a
prediction with a neural network.
x
x-y g
f
y g×z

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 8

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Backward Pass
Most of the graph variables are learned using gradient-based
optimization. The backward pass calculates the gradients of all
variables. It takes place after the forward pass.
x
x-y g
f
y g×z

• This is the crucial step for training the neural network.

? On the above graph, consider x and y as input terms, g as
intermediate term, z as model term and f as output term. Which
terms will not be updated?
• Expressing the neural network as a computational graph, allows us
to model complex network compositions / functions, where the
gradient computation can be automatized.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 9

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Back-propagation Example

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 10

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Computational Graph

• Consider the
function: x
f (x, y , z) = (x −y )z,
- g
represented by a f
computational y ×
graph.
• x = 5, y = 2 and z
z = −4 are the
inputs to the graph.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 11

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Computational Graph (Forward Pass)

• Consider the
function:
f (x, y , z) = (x −y )z, x 5
represented by a - 3g
computational f
y 2 × -12
graph.
• x = 5, x = 2 and
z -4
z = −4 is the input
to the graph.
• The output is −12.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 12

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Computational Graph (Backward Pass)
• f (x, y , z) = (x − y )z x 5
for x = 5, x = 2, - 3g
f
z = −4. y 2 × -12
• g =x −y
• f =g ×z z -4

Derivatives of the graph

• Goal? Find how the input should influence the output.
• How? Partial derivatives of the output (function) w.r.t to input.
• We need to compute: ∂f ∂f ∂f
,
∂x ∂y
, ∂z
• How? The chain rule.
• ∂f
∂f
= 1.
• ∂f ∂f
∂g
= z, ∂z
= g , where f = g × z and g = x − y = 3
• ∂g ∂g
∂x
= 1, ∂y
= −1, where g = x − y
• ∂f ∂f ∂g ∂f ∂f ∂g ∂f
∂x
= ∂g ∂x
= z · 1 = −4 · 1 = −4, ∂y
= ∂g ∂y
= 4, ∂z
=3

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 13

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Computational Graph (Backward Pass)
• Apply the chain rule
from the end to the
start.
• Every node has input
and output variables. ∂f ∂f
= −4
∂x
= −4 x 5 ∂g
We need the gradient
for each of them. - 3g
∂f f
∂f ∂f ∂g =4 y 2 × -12
• ∂x = ∂g ∂x : During the
∂y

gradient computation
∂f
there is the upstream ∂z
=3 z -4
∂f
gradient ∂g and the
∂g
local gradient ∂x .
• ∂f ∂f ∂g
Similarly: ∂y = ∂g ∂y ,
∂f ∂f ∂f ∂f ∂f ∂f
∂z = ∂f ∂z , ∂g = ∂f ∂g .

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 14

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Forward and Backward Pass
These two steps compose the main
ingredients for training a neural network.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 15

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Computational Graph (Gradient Flow)

During the forward pass, the operation f takes as input x and y to

produce z.

z
f

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 16

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Computational Graph (Gradient Flow)
∂z ∂z
During the backward pass, the local gradients ( ∂x and ∂y ) are
computed. Note that x and y, as well as z are related through the
operation f.

∂z
∂x
z
f
∂z
∂y
y

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 17

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Computational Graph (Gradient Flow)
At the beginning of the backward pass, the upstream gradient is w.r.t
the loss L, i.e. output of the last graph operation. For a neural network,
the loss is usually the last operation. It comes after the prediction.

x
∂L ∂L ∂z
∂x = ∂z ∂x
∂z
∂x
z
f
∂z ∂L
∂y ∂z
y

∂L ∂L ∂z
∂y = ∂z ∂y

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 18

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Back-propagation

A few observations:
• It is effectively the chain rule.
• We back-propagate to calculate the gradients of the network
parameters.
? How do we make use of the gradients to update the network
parameters?
• The parameters are updated with gradient descent.
• It is a repetitive process that can be efficiently implemented to
avoid computational overhead [2].
? Does back-propagation apply to unsupervised learning too?

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 19

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Back-propagation Linear Regression Example

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 20

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Linear Regression Backprop

x
×
w0
+ - (·)2
w1

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 21

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Linear Regression Backprop
x op1 = xw0 , op2 = op1 + w
× op1 op3 = y − op2 , L = op32
w0
op2 op3 L
+ - (·)2
w1

Derivatives of the graph

• Mapping: f = w0 x + w1 , where initial value of the parameters are
w0 = 0.1, w1 = 0.1 and x = 2, y = 1.5. x is the data and y the label.
• Goal: Find the best parameter values and thus ∂L ∂L
,
∂w0 ∂w1
.
• We start from the back to front to compute all derivatives. We treat each node
individually with local and up-stream derivatives.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 22

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Linear Regression Backprop
x 2 0.2 op1 = xw0 , op2 = op1 + w
× op1 op3 = y − op2 , L = op32
w0 0.1 0.3 1.2 1.44
op2 op3 2 L
+ - (·)
w1 0.1

y 1.5

Derivatives of the graph

• Node (·)2 , where the node operation is op32 : ∂L

∂op3
= 2op3 = 2.4 (local gradient)
∂L ∂L ∂L ∂L
and ∂L
= 1 (upstream). Thus ∂op3
= ∂L ∂op3
= 2.4.

• Node −, where the node operation is op3 = y − op2 : ∂op3 ∂op3

∂op2
= −1, ∂y
=1
∂L
(local gradients). The upstream gradient is ∂op3
= 2.4 and thus
∂L ∂L ∂op3 ∂L ∂L ∂op3
∂op2
= ∂op3 ∂op2
= −2.4, ∂y
= ∂op3 ∂y
= 2.4.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 23

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Linear Regression Backprop
x 2 0.2 op1 = xw0 , op2 = op1 + w
× op1 op3 = y − op2 , L = op32
w0 0.1 0.3 1.2 1.44
op2 op3 2 L
+ - (·)
w1 0.1

y 1.5

Derivatives of the graph

• Node +, where the node operation is op2 = op1 + w1 : ∂op2 ∂op2

∂op1
= 1, ∂w1
= 1 (local
∂L
gradients); ∂op = −2.4 (upstream gradient). Thus
2
∂L ∂L ∂op2 ∂L ∂L ∂op2
∂op1
= ∂op2 ∂op1
= −2.4, ∂w = ∂op = −2.4.
1 2 ∂w1
• Node ×, where the node operation is op1 = xw0 :
∂op1
∂x
= w0 = 0.1, ∂op
∂w
1
= x = 2 (local gradients); ∂L
∂op1
= −2.4 (upstream
0
∂L ∂L ∂op1 ∂L ∂L ∂op1
gradient). Finally ∂x
= ∂op1 ∂x
= −0.48, ∂w0
= ∂op ∂w = −4.8.
1 0

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 24

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Backpopagation → Gradient Descent

• After computing all gradients, the update steps takes place for the
learnable parameters.
? Which parameters, in the above example, should be updated?
• The optimization algorithm is (mini-batch) gradient descent.
• The parameter update is given by:

∂L
w0 = w0 − η , (1)
∂w0
∂L
w1 = w1 − η . (2)
∂w1
• The process of forward, backward pass and parameter update is
iterative, until convergence.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 25

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Linear Regression Backprop - Code
1 # data and label
2x = 2
3 y = 1.5
4
5 # initial parameter values
6 w0 = w1 = 0.1
7
8 for _ in range (100) :
9 # operations
10 op1 = x * w0
1
11 op2 = op1 + w1
2
12 op3 = y - op2
3 # addition node : x * w0
13 L = op3 **2
4 # local gradient ( s )
14
5 dop1_dx = w0
15 # exponent node : op3 ^2
6 dop1_dw0 = x
16 # local gradient ( s )
7 # output gradient ( s )
17 dL_dop3 = 2* op3
8 dL_dx = dL_dop1 * dop1_dx
18 # upstream gradient
9 dL_dw0 = dL_dop1 * dop1_dw0
19 dL_dL = 1
10
20 # output gradient ( s )
11 # update parameters ( lr , learning rate )
21 dL_dop3 = dL_dL * dL_dop3
12 lr = 0.01
22
13 w0 = w0 - lr * dL_dw0
23 # subtraction node : op2 - y
14 w1 = w1 - lr * dL_w1
24 # local gradient ( s )
15 f = w0 * x + w1
25 dop3_dop2 = -1
16
26 dop3_dy = 1
17 print ( ’ Loss : {:.3} and parameters
27 # output gradient ( s )
{:.3} ,{:.3} ’. format (L , w0 , w1 ) )
28 dL_dop2 = dL_dop3 * dop3_dop2
29 dL_y = dL_dop3 * dop3_dy
30
31 # addition node : op1 + w1
32 # local gradient ( s )
33 dop2_dop1 = 1
34 dop2_dw1 = 1
35 # output gradient ( s )
36 dL_dop1 = dL_dop2 * dop2_dop1
37 dL_w1 = dL_dop2 * dop2_dw1
Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 26

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Linear Regression Backprop - PyTorch Code

1 import torch
2
3 x = torch . tensor ([2.0])
4 y = torch . tensor ([1.5])
5
6 w0 = torch . tensor ([0.1] , requires_grad = True )
7 w1 = torch . tensor ([0.1] , requires_grad = True )
8
9 for _ in range (100) :
10 f = ( x * w0 ) + w1
11 loss = (y - f ) . pow (2) . sum ()
12 loss . backward ()
13
14 lr = 1e -2
15 with torch . no_grad () :
16 w0 -= lr * w0 . grad
17 w1 -= lr * w1 . grad
18
19 w0 . grad . zero_ ()
20 w1 . grad . zero_ ()
21 print ( ’ Loss : {:.3} and parameters {:.3} ,{:.3} ’. format ( loss . item () , w0 . item () , w1 . item () ) )

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 27

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

PyTorch Implementation

? What is missing from the PyToch implementation, compared to

python code from above?
• The gradients can be computed automatically with automatic
differentiation (autodiff / autograd) [3].
• Every computer program consists of a sequence of elementary
arithmetic operations and elementary functions.
• Applying repeatedly the chain rule, results in computing the
gradients.
• Most modern DL frameworks, as PyTorch, work with autodiff.
• Programming new operations (layers) for neural networks involves
defining the forward and backward step. With autodiff, only the
forward step is necessary. Note without autodiff, it would be
important to compute the numerical gradients for controlling the
backward step’s gradients.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 28

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Linear Regression Backprop - Observations

• A simple case of a neural network.

• Deep neural networks use the same training principles at large scale.
• It is important to efficiently program the forward and backward
operations to generalize on millions of parameters and training data.
• In the general case, the input has vector form.
• The Jacobian matrix is then computed for all variables of an
operation.
• The gradient descent algorithm is applied on the same way on
vectors.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 29

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Jacobian Matrix

Fo example, consider the function f : R4 → R3 with input x and output

z. The derivative of each output element w.r.t each input element of
input is the Jacobian that is given by:
 ∂z ∂z ∂z ∂z

1 1 1 1
∂x ∂x2 ∂x3 ∂x4
 ∂z1 ∂z2 ∂z2 ∂z2 

J= 2
 ∂x1 ∂x2 ∂x3 ∂x4  (3)
∂z3 ∂z3 ∂z3 ∂z3
∂x1 ∂x2 ∂x3 ∂x4

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 30

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Back-propagation and Vectorization

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 31

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Computational Graph (Vectorized Operations)
The main difference of having vector input / output is on the backward
step. The Jacobian matrix has now to be computed for one or more
pairs of input and output.

x
∂L ∂L ∂z
∂x = ∂z ∂x
∂z
For example if ∂x
z
x and z are vec- f
∂z ∂z ∂L
tors, then ∂x ∂y ∂z
corresponds to y
the Jacobian.

∂L ∂L ∂z
∂y = ∂z ∂y

∂z
• Similarly, we compute the Jacobian for ∂y .
? Do we compute the Jacobian for the loss L as well?

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 32

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Back-propagation, Vectorized Example

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 33

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Vectorized Example

Consider the function:

f (x, W) = W × x
where x ∈ Rn×n and
W ∈ Rn×n . The W
operation represented X0 P X1 X2 L
× - ||
as the computational
graph is the
Hadamard product, x y
including the loss
function
Pn×n L =
| i=1 f (x, W) − y |.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 34

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Vector (Forwards -Backward

and Parameter Update)
0.1 0.2
0.20.6
The Jacobian or 0.3 0.2
1.21.6 3.6 −6.4 6.4
derivatives are W
X0 P X1 X2 L
computed during × - ||
2.0 3.0
back-propagation. 4.0 8.0 10.0
x y
Back-propagation

• Node ||, where the node operation is L = |X2 |: ∂L X2

∂X2
= |X2 |
= −1 (local
∂L
gradient) and ∂L
= 1 (upstream gradient).
• Node −, where the node operation is X2 = X1 − y : ∂X2
∂X1
= 1 (local gradient)
∂L ∂L ∂L ∂X2
and ∂X2
= −1 (upstream gradient). ∂X1
= ∂X2 ∂X1
= −1.
• Node
P
, where the operation is "X1 = X0 (0, 0) + X0# (0, 1) + X0 (1, 0) + X0 (1, 1):
∂X1 ∂X1
∂X1 1 1
∂X0 (i,j)
The Jacobian matrix is ∂X∂X
. 0 (0,0) ∂X0 (0,1)
∂X1 = (local
1 1 1
∂X0 (1,0) ∂X
0 (1,1)
∂L ∂L ∂L ∂X1 −1 −1
gradients) and ∂X = −1. ∂X = ∂X = .
1 0 1 ∂X0 −1 −1

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 35

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Vector (Forwards -Backward

and Parameter Update)
We skip the gradient 0.1 0.2
0.2 0.6
of x and focus only 0.3 0.2
1.2 1.6 3.6 −6.4 6.4
on the gradients of W
X0 P X1 X2 L
W because these are × - ||
2.0 3.0
the parameters to be 4.0 8.0 10.0
learned. x y
Back-propagation

• Node ×, where the operation is X0 = Wx (elementwise multiplication): For ∂X0

" ∂X ∂W
∂X0
#
0
∂W(0,0) ∂W(0,1) 2 3
(local gradient), the Jacobian matrix is ∂X0 ∂X0 = . The
4 8
∂W(1,0) ∂W(1,1)
∂L ∂L ∂L ∂X0 −2 −3
upstream gradient ∂X is known and thus ∂W = ∂X = . Note
0 0 ∂W −4 −8
the elementwise multiplication.
• Finally we update the parameters with gradient descent: W = W − η ∂L .
∂W
• The forward step, backward step and parameter update are iteratively applied
until convergence.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 36

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Vectorized Examples (PyTorch)

1 import torch
2
3 W = torch . tensor ([[0.1 , 0.2] , [0.3 , 0.2]] , requires_grad = True )
4 x = torch . tensor ([[2.0 , 3.0] , [4.0 , 8.0]])
5 y = 10
6
7 for _ in range (1000) :
8 X = []
9 X . append ( torch . mul (W , x ) )
10 X . append ( torch . sum ( X [ -1]) )
11 X . append ( X [ -1] - y )
12
13 # store the gradients
14 X [0]. retain_grad ()
15 X [1]. retain_grad ()
16 X [2]. retain_grad ()
17
18 loss = torch . abs ( X [ -1])
19 loss . backward ()
20
21 lr = 1e -4
22 with torch . no_grad () :
23 W -= lr * W . grad
24 W . grad . zero_ ()
25 print ( loss . item () )

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 37

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Vectorized Example (Observations)

• Back-propagation and gradient descent works on the same way with

vectors and scalars.
• When working with vectors, the gradients of a variable should
have the same dimension with the variable.
• The operations can be also described as layers or modules.
• It is a requirement for a module / layer to be differentiable.
• A few popular modules are linear, ReLU, sigmoid, tanh, Softmax,
L2-loss or cross entropy loss.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 38

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Neural Network Modules

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 39

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Neural Network Modules

z
f

• f → differentiable operation.
• x, y → inputs that corresponds to scalar, vectors, matrices or
tensors.
• z → output represented by scalar, vectors, matrices or tensors.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 40

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Identity

• Function: f (x) = x.
• Range: (−∞, ∞).
• Derivative: f 0 (x) = 1.
• The derivative is continuous
and there are no vanishing
gradients.
• It is common activation for
regression problems.
? Name applications that require
the identity module as output.
• One usually finds it as the last
layer / module of a neural
network.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 41

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Rectified Linear Unit (ReLU) [4]
• Function:

x x ≥0
f (x) =
0 x <0
• Range: [0, ∞).
• Derivative:

0 1 x ≥0
f (x) =
0 x <0
• Derivative range: {0, 1}.
• It is the most common
activation function in the deep
learning era.
• The derivative is not
continuous as 0.
• It causes problems for
non-positive inputs.
? What is the motivation behind
ReLU?
Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 42

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Leaky Rectified Linear Unit (ReLU) [5]
• Function:

x x ≥0
f (x) =
0.5x x <0
• Range: (−∞, ∞).
• Derivative:

0 1 x ≥0
f (x) =
0.5 x <0
• Derivative range: {0.5, 1}.
• The main difference from ReLU
is that is allows negative input
to pass with a small slope.
• The derivative is not
continuous as 0.
• The common coefficient is 0.01.
? Is Leaky ReLU a good
activation for regression?

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 43

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Exponential Linear Unit (ELU) [6]
• Function:

x x ≥0
f (x) =
α(e x − 1) x <0
• Range: [−α, ∞) for α ≥ 0 and
[0, ∞) for α < 0.
• Derivative:

0 1 x ≥0
f (x) =
αe x x <0
• Derivative range: [ 0, α) ∪ {1}.
• It usually converges faster than
ReLU and delivers better
performance.
• It is smoother than ReLU.
? Is ELU a good activation for
regression?

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 44

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Logistic Sigmoid
• Function: f (x) = 1
1+e −x .
• Range: [0, 1].
• Derivative: f 0 (x) = e −x
(1+e −x )2 .
• Derivative range: (0, 0.25].
• The derivative is continuous,
but there are vanishing
gradients and saturation.
• It is known from the logistic
regression.
• It is used for output activation
on classification problems.
? What is the advantage by
bounding the range of the
function?

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 45

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Hyperbolic Tangent (tanh)

• Function: f (x) = e x −e −x
e x +e −x .
• Range: (−1, 1).
• Derivative: f 0 (x) = 1 − f (x)2 .
• Derivative range: (0, 1].
• The derivative is continuous,
but there are vanishing
gradients and saturation.
• It is used for output activation
on classification problems.
? What is the main difference
from sigmoid?

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 46

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Softmax
• The standard activation for multi-class problems. For K -class
prediction, there will be K activations (possible outputs).
x
• Function: f (x, i) = PKe i xj . Note that K
P
e j=1 f (x, j) = 1.
j=1
• Range: [0, 1].
• The input and output of the softmax is represented by a vector (of
K-elements). As a result, we need to compute the Jacobian during
the backward step.
 ∂f 1 1

∂x1 · · · ∂f
∂xk
 ∂f 2 2
 ∂x · · · ∂f 
• Jacobian:  1 ∂xk 
· · · · · · · · · 
 
∂f k ∂f k
∂x1 · · · ∂xk

∂f i
• Derivative rule: ∂x j = f (x, i)(1 − f (x, j)) i =j
−f (x, i)f (x, j) i 6= j
? This is an expensive operation. Do we need to compute the
complete Jacobian matrix?

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 47

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Softmax (Numerical Stability)
• To avoid having very small number and very large numbers at the
power of the exponent, we can normalize the inputs.
• Common errors: Errors would be RuntimeWarning: overflow
encountered in exp expo = np.exp(x) or RuntimeWarning:
invalid value encountered in true divide return
expo/np.sum(expo).
• We introduce a constant C at the softmax:
xi xi
f (x, i) = PKe e xj = PKCe Ce xj .
j=1 j=1
• Next we bring the constant to the exponent:
xi ln(C ) xi
PKCe x = PKe ln(C e e xi +ln(C )
Ce j e ) e xj = PK xj +ln(C ) .
j=1 j=1 j=1 e
• We can replace ln(C ) with D and treat it as an arbitrary constant:
e xi +ln(C ) e xi +D
PK xj +ln(C ) = PK xj +D .
j=1 e j=1 e
• Finally, we choose D = −max(x) that shifts the input closer to zero.
There will be negative values, other than the maximum. As a result,
very large exponents will be shifted to zero.
? What could happen if we don’t shift the values to be negative?

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 48

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Softmax (Code)

1 import numpy as np
2
3 x = []
4 x . append ([1 , 2 , 3])
5 x . append ([1 e -20 , 2e -20 , 3e -20])
6 x . append ([1000 , 2000 , 3000])
7
8
9 def softmax ( x ) :
10 expo = np . exp ( x )
11 return expo / np . sum ( expo )
12
13
14 def softma x_stable ( x ) :
15 c = x - np . max ( x )
16 expo = np . exp ( c )
17 return expo / np . sum ( expo )
18
19
20 for v in x :
21 print ( softmax ( v ) )
22 print ( soft max_stab le ( v ) )
23 print ( ’\ n ’)

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 49

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Softmax (Observations)

• We find softmax mostly as the last module of a network. It has been

recently used as attention mechanism as well [7].
• The input to the softmax is called logits and it expresses
non-normalized probabilities.
• The output of softmax is a probability distribution.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 50

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Cross Entropy
• Cross entropy measures the distance between two probability
distributions (in our context).
• One probability distribution corresponds to the softmax output of
the neural network and the other to the ground-truth.
• The ground-truth is the one-hot vector that is defined as a
K-element vector. All of its elements are zero other than the correct
class that is set to one.
• The cross entropy is given by ce(p, q) = − k p(k) log(q(k)),
P
where p and q are compared probability distributions.
• In our problem, we have K classes. We define the ground-truth
distribution as y and the softmax
PK output as p. The cross entropy is
re-written as: ce(y, p) = − k=1 y (k) log(p(k)), where p(k) is the
probability predicted by the model and y (k) the ground-truth
probability.
• y (k) = 1 only for the ground-truth class and the cross entropy can
be now written as: ce(y, p) = − log(p(y )), where y now refers to
the ground-truth class (slight abuse the y notation).

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 51

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Cross Entropy (cont.)

• We can further simplify the formulation by considering only p as

variable and the ground-truth class as constant. This makes the
cross entropy: ce(p) = − log(py ).
• The input to the cross entropy is K predictions and the output a
scalar. Consequently, we need to compute the Jacobian in the
backward step.
h i
• Jacobian: ∂ce∂p1
∂ce
· · · ∂p K
.
• However the cross entropy function is evaluated only for the correct
class y and thus the Jacobian is everywhere
h zero other than
i the
∂ce
correct class. It can be written as: 0 · · · ∂py 0 · · · .
• The combination of the cross entropy with the softmax function is
the standard approach for multi-class classification problems.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 52

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

L1 and L2 loss

These are the most common loss functions for regression problems. The
target y of the prediction x can vary from a set of values to matrices and
tensors.
L1-Loss L2-Loss (Mean squared difference)
• It minimizes the absolute • It minimizes the squared
difference between the difference between the
prediction and target. prediction and target.
• It can be written as: • It can be written as:
PK PK
L = i |yi − xi |, where K is L = i (yi − xi )2 , where K is
the output dimension. the output dimension.
Note that the L2-loss can have square root in addition.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 53

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Study Material

• Book: Deep Learning(Ian Goodfellow and Yoshua Bengio and Aaron

Courville), Chapter 6, Deep Feedforward Networks. On-line available
at: https://www.deeplearningbook.org.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 54

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Next Lecture

Optimization

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 55

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

References I
[1] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point
sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 652–660, 2017.
[2] Yann A LeCun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. Efficient backprop.
In Neural networks: Tricks of the trade, pages 9–48. Springer, 2012.
[3] Atilim Gunes Baydin, Barak A Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark
Siskind. Automatic differentiation in machine learning: a survey. Journal of Marchine Learning
Research, 18:1–43, 2018.
[4] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann
machines. In Proceedings of the 27th international conference on machine learning (ICML-10),
pages 807–814, 2010.
[5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers:
Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE
international conference on computer vision, pages 1026–1034, 2015.
[6] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep
network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.
[7] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural
Information Processing Systems, pages 5998–6008, 2017.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 56

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

ML-5TH Unit
No ratings yet
ML-5TH Unit
28 pages
Neural Networks
No ratings yet
Neural Networks
52 pages
First
No ratings yet
First
92 pages
Lecture 2 - Neural Network v1.0
No ratings yet
Lecture 2 - Neural Network v1.0
64 pages
Intro Deep Learning
No ratings yet
Intro Deep Learning
43 pages
9 Neural Networks Learning
No ratings yet
9 Neural Networks Learning
38 pages
Unit 2 Deep Learning
No ratings yet
Unit 2 Deep Learning
19 pages
Feedforward Neural Networks in Depth, Part 1 - Forward and Backward Propagations - I, Deep Learning
No ratings yet
Feedforward Neural Networks in Depth, Part 1 - Forward and Backward Propagations - I, Deep Learning
11 pages
What Is Gradient Based Learning in Deep Learning
100% (1)
What Is Gradient Based Learning in Deep Learning
12 pages
Deep Learning PDF
100% (1)
Deep Learning PDF
87 pages
Components-Algorithms/: The Basic Architecture of Neural Networks: Single Computational Layer
No ratings yet
Components-Algorithms/: The Basic Architecture of Neural Networks: Single Computational Layer
65 pages
ML Lec 10 Neural Networks
No ratings yet
ML Lec 10 Neural Networks
87 pages
Slides 11
No ratings yet
Slides 11
48 pages
Deep Learning - Intro, Methods & Applications
100% (1)
Deep Learning - Intro, Methods & Applications
37 pages
Neural Network Training
No ratings yet
Neural Network Training
73 pages
Deep Learning
No ratings yet
Deep Learning
38 pages
Neural Nets
No ratings yet
Neural Nets
33 pages
Deep Learning
100% (4)
Deep Learning
100 pages
DS303 NN
No ratings yet
DS303 NN
20 pages
Lecture 12 - Neural Networks (DONE!!) PDF
No ratings yet
Lecture 12 - Neural Networks (DONE!!) PDF
27 pages
Week 03-04 - Deep Feedforward Networks - Intro
No ratings yet
Week 03-04 - Deep Feedforward Networks - Intro
141 pages
Lecture2 Slides 1
No ratings yet
Lecture2 Slides 1
28 pages
DL Unit 3 Notes
No ratings yet
DL Unit 3 Notes
16 pages
Appendhhdhdh
No ratings yet
Appendhhdhdh
17 pages
DL 2
No ratings yet
DL 2
62 pages
Ai - W7L13
No ratings yet
Ai - W7L13
46 pages
Unit III
No ratings yet
Unit III
29 pages
Applied Deep Learning - Part 1 - Artificial Neural Networks - by Arden Dertat - Towards Data Science
No ratings yet
Applied Deep Learning - Part 1 - Artificial Neural Networks - by Arden Dertat - Towards Data Science
34 pages
L10 Neural Network
No ratings yet
L10 Neural Network
52 pages
5 - From Linear Models To Multi-Layer Perceptrons
No ratings yet
5 - From Linear Models To Multi-Layer Perceptrons
45 pages
Intro To DL
No ratings yet
Intro To DL
28 pages
LLM For Maths People
No ratings yet
LLM For Maths People
53 pages
Unit 2 Deep Learning and Neural Networks
No ratings yet
Unit 2 Deep Learning and Neural Networks
38 pages
Unit 5 ML
No ratings yet
Unit 5 ML
37 pages
Lecture 09 Slides - After
No ratings yet
Lecture 09 Slides - After
57 pages
Unit I
No ratings yet
Unit I
90 pages
DL 02 Deep Forward Networks
No ratings yet
DL 02 Deep Forward Networks
47 pages
Understanding Feed Forward Neural Networks in Deep Learning
No ratings yet
Understanding Feed Forward Neural Networks in Deep Learning
10 pages
Artificial Intelligence Basics
No ratings yet
Artificial Intelligence Basics
13 pages
Module 02
No ratings yet
Module 02
20 pages
Udacity Deep LEarning Part4 RNN
No ratings yet
Udacity Deep LEarning Part4 RNN
338 pages
Lecture 1
No ratings yet
Lecture 1
38 pages
CS 611 Slides 5
No ratings yet
CS 611 Slides 5
28 pages
Module 2
No ratings yet
Module 2
44 pages
CS224n: Natural Language Processing With Deep Learning
No ratings yet
CS224n: Natural Language Processing With Deep Learning
18 pages
04 - Neural Networks PDF
No ratings yet
04 - Neural Networks PDF
46 pages
Neural Network Oxygen
No ratings yet
Neural Network Oxygen
25 pages
Statistics and Machine Learning in Python
No ratings yet
Statistics and Machine Learning in Python
300 pages
Module 2 DL Snotes P1
No ratings yet
Module 2 DL Snotes P1
16 pages
Unit 03 - Neural Networks - MD
No ratings yet
Unit 03 - Neural Networks - MD
24 pages
An Introduction To Neural Networks: Instituto Tecgraf PUC-Rio Nome: Fernanda Duarte Orientador: Marcelo Gattass
No ratings yet
An Introduction To Neural Networks: Instituto Tecgraf PUC-Rio Nome: Fernanda Duarte Orientador: Marcelo Gattass
45 pages
Unit 1
No ratings yet
Unit 1
20 pages
Lec 23
No ratings yet
Lec 23
13 pages
ST M Hdstat RNN Deep Learning
No ratings yet
ST M Hdstat RNN Deep Learning
17 pages
Unit 2 v1.
No ratings yet
Unit 2 v1.
41 pages
Sparseautoencoder 2011new
No ratings yet
Sparseautoencoder 2011new
19 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
14 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
14 pages
Mini Document 2
No ratings yet
Mini Document 2
44 pages
Deep Learning: Huawei AI Academy Training Materials
No ratings yet
Deep Learning: Huawei AI Academy Training Materials
47 pages
Perceptron Network
No ratings yet
Perceptron Network
26 pages
Jyotirmoy Ghosh Thesis PDF
No ratings yet
Jyotirmoy Ghosh Thesis PDF
192 pages
Driver Drowsiness Detection - (Sample) Project Synopsis Report
No ratings yet
Driver Drowsiness Detection - (Sample) Project Synopsis Report
16 pages
Intro To QMLand QNN
No ratings yet
Intro To QMLand QNN
13 pages
Recommended System
No ratings yet
Recommended System
33 pages
Geo AI
No ratings yet
Geo AI
50 pages
Deep Learning Methods and Applications Li Deng Dong Yu PDF Download
No ratings yet
Deep Learning Methods and Applications Li Deng Dong Yu PDF Download
49 pages
Cs3491 Artificial Intelilgence and Machine Learning
No ratings yet
Cs3491 Artificial Intelilgence and Machine Learning
22 pages
Product Affinity Analysis To Increase Sales Using Machine Learning
No ratings yet
Product Affinity Analysis To Increase Sales Using Machine Learning
8 pages
Comparison of Classical and Machine Learning Methods in Estimation of Missing Streamflow Data
No ratings yet
Comparison of Classical and Machine Learning Methods in Estimation of Missing Streamflow Data
26 pages
Deep Learning Algorithms
No ratings yet
Deep Learning Algorithms
19 pages
Final Report PDF
No ratings yet
Final Report PDF
33 pages
An Evaluation of Machine Learning and Deep Learning Models For Drought Prediction Using Weather Dara
No ratings yet
An Evaluation of Machine Learning and Deep Learning Models For Drought Prediction Using Weather Dara
36 pages
Final Report
No ratings yet
Final Report
20 pages
Forecasting System Imbalance Volumes in Competitive Electricity Markets
No ratings yet
Forecasting System Imbalance Volumes in Competitive Electricity Markets
10 pages
Computers and Operations Research: Babak Abbasi, Toktam Babaei, Zahra Hosseinifard, Kate Smith-Miles, Maryam Dehghani
No ratings yet
Computers and Operations Research: Babak Abbasi, Toktam Babaei, Zahra Hosseinifard, Kate Smith-Miles, Maryam Dehghani
20 pages
Deep Learning and NLP With PYTHON - Course Outline
No ratings yet
Deep Learning and NLP With PYTHON - Course Outline
11 pages
Deep Learning For Credit Card Fraud Detection A Review of Algorithms Challenges and Solutions
No ratings yet
Deep Learning For Credit Card Fraud Detection A Review of Algorithms Challenges and Solutions
18 pages
IC 1403 Neural Network and Fuzzy Logic Control PDF
No ratings yet
IC 1403 Neural Network and Fuzzy Logic Control PDF
6 pages
Learning Deep Gradient Descent Optimization For Image Deconvolution
No ratings yet
Learning Deep Gradient Descent Optimization For Image Deconvolution
40 pages
1-S2.0-S0925231223004502-Main 24
No ratings yet
1-S2.0-S0925231223004502-Main 24
24 pages
7 Jan2015
No ratings yet
7 Jan2015
50 pages
CS3491 AI and ML Important Question Bank
No ratings yet
CS3491 AI and ML Important Question Bank
8 pages
Debutanizer Process Control PDF
No ratings yet
Debutanizer Process Control PDF
9 pages
Register Value Prediction Using Metapredictors: Lucian N. Vintan, Arpad Gellert and Adrian Florea
No ratings yet
Register Value Prediction Using Metapredictors: Lucian N. Vintan, Arpad Gellert and Adrian Florea
10 pages
Unit 2
No ratings yet
Unit 2
10 pages
8.01 Machine Learning Basics
No ratings yet
8.01 Machine Learning Basics
6 pages
Convolutional Neural Networks: Fundamentals and Applications for Analyzing Visual Imagery
From Everand
Convolutional Neural Networks: Fundamentals and Applications for Analyzing Visual Imagery
Fouad Sabry
No ratings yet
Flood Fill: Flood Fill: Exploring Computer Vision's Dynamic Terrain
From Everand
Flood Fill: Flood Fill: Exploring Computer Vision's Dynamic Terrain
Fouad Sabry
No ratings yet
Feedforward Neural Networks: Fundamentals and Applications for The Architecture of Thinking Machines and Neural Webs
From Everand
Feedforward Neural Networks: Fundamentals and Applications for The Architecture of Thinking Machines and Neural Webs
Fouad Sabry
No ratings yet

3.NN Backprop

Uploaded by

3.NN Backprop

Uploaded by

Deep Learning for Computer Vision

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 2

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 3

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

• Note that learning the parameters can be also accomplished with:

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 4

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 5

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

It refers to evaluating the function for a set of inputs, where the

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 8

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

• This is the crucial step for training the neural network.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 9

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 10

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 11

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 12

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

Derivatives of the graph

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 13

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 14

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 15

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

During the forward pass, the operation f takes as input x and y to

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 16

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 17

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 18

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 19

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 20

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 21

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

Derivatives of the graph

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 22

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

Derivatives of the graph

• Node (·)2 , where the node operation is op32 : ∂L

• Node −, where the node operation is op3 = y − op2 : ∂op3 ∂op3

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 23

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

Derivatives of the graph

• Node +, where the node operation is op2 = op1 + w1 : ∂op2 ∂op2

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 24

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 25

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 27

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

? What is missing from the PyToch implementation, compared to

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 28

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

• A simple case of a neural network.

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 29

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

Fo example, consider the function f : R4 → R3 with input x and output

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 30

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 31

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 32

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 33

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)