3.NN Backprop
3.NN Backprop
Lecture 3
Feed Forward Networks and Back-propagation
Vasileios Belagiannis
Chair of Multimedia Communications and Signal Processing
Friedrich-Alexander-Universität Erlangen-Nürnberg
10.11.2023
In last lecture, the topics was the Machine Learning Basics. In detail, we
discussed:
• Machine learning definition.
• Capacity, under-fitting, over-fitting.
• Regularization.
• Maximum Likelihood Estimation.
• Gradient-based optimization.
• Classification.
• Regression.
z
• Note that instead of writing the variable names on the edges (data
flow), we could use nodes to represent variables too.
• We represent neural networks as computational graphs with
forward and backward passes.
Vasileios Belagiannis 10.11.2023 Introduction to Deep Learning, 3. Backprop 7
• Consider the
function: x
f (x, y , z) = (x −y )z,
- g
represented by a f
computational y ×
graph.
• x = 5, y = 2 and z
z = −4 are the
inputs to the graph.
• Consider the
function:
f (x, y , z) = (x −y )z, x 5
represented by a - 3g
computational f
y 2 × -12
graph.
• x = 5, x = 2 and
z -4
z = −4 is the input
to the graph.
• The output is −12.
gradient computation
∂f
there is the upstream ∂z
=3 z -4
∂f
gradient ∂g and the
∂g
local gradient ∂x .
• ∂f ∂f ∂g
Similarly: ∂y = ∂g ∂y ,
∂f ∂f ∂f ∂f ∂f ∂f
∂z = ∂f ∂z , ∂g = ∂f ∂g .
z
f
∂z
∂x
z
f
∂z
∂y
y
x
∂L ∂L ∂z
∂x = ∂z ∂x
∂z
∂x
z
f
∂z ∂L
∂y ∂z
y
∂L ∂L ∂z
∂y = ∂z ∂y
A few observations:
• It is effectively the chain rule.
• We back-propagate to calculate the gradients of the network
parameters.
? How do we make use of the gradients to update the network
parameters?
• The parameters are updated with gradient descent.
• It is a repetitive process that can be efficiently implemented to
avoid computational overhead [2].
? Does back-propagation apply to unsupervised learning too?
x
×
w0
+ - (·)2
w1
y 1.5
y 1.5
• After computing all gradients, the update steps takes place for the
learnable parameters.
? Which parameters, in the above example, should be updated?
• The optimization algorithm is (mini-batch) gradient descent.
• The parameter update is given by:
∂L
w0 = w0 − η , (1)
∂w0
∂L
w1 = w1 − η . (2)
∂w1
• The process of forward, backward pass and parameter update is
iterative, until convergence.
1 import torch
2
3 x = torch . tensor ([2.0])
4 y = torch . tensor ([1.5])
5
6 w0 = torch . tensor ([0.1] , requires_grad = True )
7 w1 = torch . tensor ([0.1] , requires_grad = True )
8
9 for _ in range (100) :
10 f = ( x * w0 ) + w1
11 loss = (y - f ) . pow (2) . sum ()
12 loss . backward ()
13
14 lr = 1e -2
15 with torch . no_grad () :
16 w0 -= lr * w0 . grad
17 w1 -= lr * w1 . grad
18
19 w0 . grad . zero_ ()
20 w1 . grad . zero_ ()
21 print ( ’ Loss : {:.3} and parameters {:.3} ,{:.3} ’. format ( loss . item () , w0 . item () , w1 . item () ) )
x
∂L ∂L ∂z
∂x = ∂z ∂x
∂z
For example if ∂x
z
x and z are vec- f
∂z ∂z ∂L
tors, then ∂x ∂y ∂z
corresponds to y
the Jacobian.
∂L ∂L ∂z
∂y = ∂z ∂y
∂z
• Similarly, we compute the Jacobian for ∂y .
? Do we compute the Jacobian for the loss L as well?
1 import torch
2
3 W = torch . tensor ([[0.1 , 0.2] , [0.3 , 0.2]] , requires_grad = True )
4 x = torch . tensor ([[2.0 , 3.0] , [4.0 , 8.0]])
5 y = 10
6
7 for _ in range (1000) :
8 X = []
9 X . append ( torch . mul (W , x ) )
10 X . append ( torch . sum ( X [ -1]) )
11 X . append ( X [ -1] - y )
12
13 # store the gradients
14 X [0]. retain_grad ()
15 X [1]. retain_grad ()
16 X [2]. retain_grad ()
17
18 loss = torch . abs ( X [ -1])
19 loss . backward ()
20
21 lr = 1e -4
22 with torch . no_grad () :
23 W -= lr * W . grad
24 W . grad . zero_ ()
25 print ( loss . item () )
z
f
• f → differentiable operation.
• x, y → inputs that corresponds to scalar, vectors, matrices or
tensors.
• z → output represented by scalar, vectors, matrices or tensors.
• Function: f (x) = x.
• Range: (−∞, ∞).
• Derivative: f 0 (x) = 1.
• The derivative is continuous
and there are no vanishing
gradients.
• It is common activation for
regression problems.
? Name applications that require
the identity module as output.
• One usually finds it as the last
layer / module of a neural
network.
• Function: f (x) = e x −e −x
e x +e −x .
• Range: (−1, 1).
• Derivative: f 0 (x) = 1 − f (x)2 .
• Derivative range: (0, 1].
• The derivative is continuous,
but there are vanishing
gradients and saturation.
• It is used for output activation
on classification problems.
? What is the main difference
from sigmoid?
∂x1 · · · ∂f
∂xk
∂f 2 2
∂x · · · ∂f
• Jacobian: 1 ∂xk
· · · · · · · · ·
∂f k ∂f k
∂x1 · · · ∂xk
∂f i
• Derivative rule: ∂x j = f (x, i)(1 − f (x, j)) i =j
−f (x, i)f (x, j) i 6= j
? This is an expensive operation. Do we need to compute the
complete Jacobian matrix?
1 import numpy as np
2
3 x = []
4 x . append ([1 , 2 , 3])
5 x . append ([1 e -20 , 2e -20 , 3e -20])
6 x . append ([1000 , 2000 , 3000])
7
8
9 def softmax ( x ) :
10 expo = np . exp ( x )
11 return expo / np . sum ( expo )
12
13
14 def softma x_stable ( x ) :
15 c = x - np . max ( x )
16 expo = np . exp ( c )
17 return expo / np . sum ( expo )
18
19
20 for v in x :
21 print ( softmax ( v ) )
22 print ( soft max_stab le ( v ) )
23 print ( ’\ n ’)
These are the most common loss functions for regression problems. The
target y of the prediction x can vary from a set of values to matrices and
tensors.
L1-Loss L2-Loss (Mean squared difference)
• It minimizes the absolute • It minimizes the squared
difference between the difference between the
prediction and target. prediction and target.
• It can be written as: • It can be written as:
PK PK
L = i |yi − xi |, where K is L = i (yi − xi )2 , where K is
the output dimension. the output dimension.
Note that the L2-loss can have square root in addition.
Optimization