0% found this document useful (0 votes)
17 views37 pages

Lecture 02

Uploaded by

Tim Widmoser
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views37 pages

Lecture 02

Uploaded by

Tim Widmoser
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

WiSe 2023/24

Deep Learning 1

Lecture 2 Error Backpropagation


Outline

▶ The Perceptron algorithm

▶ Perceptron as gradient descent

▶ How to compute the gradient in a neural network

▶ Numerical gradient computation

▶ The error backpropagation algorithm


▶ Chain rule and multivariate chain rule
▶ Worked through example
▶ General equation for backpropagation
▶ Vectorization of backpropagation
▶ Automatic dierentiation

1/31
Part 1 The Perceptron

2/31
The Perceptron

▶ Algorithm proposed in 1958 by F. Rosenblatt to


train single-layer neural networks.

▶ The algorithm produces classiers that prefectly


separates training data (if the data is linearly
separable).

▶ The algorithm consists of simple and cheap


iterative procedure.

F. Rosenblatt (19281971)

3/31
The Perceptron

Structure of the perceptron:


▶ A weighted sum of the input
x1 features:
w1
Pd
x2 w2 z= i=1 wi xi + b
y ⊤
=w x+b
wd
followed by the sign function
xd
y = sign(z)
Problem formulation: Let

x(1) , x(2) , . . . , x(N ) ∈ Rd

be our input data and t(1) , t(2) , . . . , t(N ) ∈ {−1, 1} our corresponding labels (aka.
targets). The goal of the perceptron is to learn a collection of parameters (w, b)
such that all points are correctly classied, i.e.

∀N
k=1 : y
(k)
= t(k)

4/31
The Perceptron Algorithm

Recall that for each data point, predictions of our perceptron are computed as:

z (k) = w⊤ x(k) + b
y (k) = sign(z (k) )

Perceptron algorithm
▶ Iterate (multiple times from k = 1 . . . N ).
▶ If x(k) is correctly classied (y
(k) = t(k) ), continue.
▶ If x(k) is wrongly classied (y
(k) ̸= t(k) ), update the perceptron:

(k) (k)
w ←w+η·x t
(k)
b←b+η·t

where η is a learning rate.

▶ Stop once all examples are correctly classied.

5/31
Perceptron at Work

iteration 0 iteration 1 iteration 3


4 4 4

2 2 2

0 0 0

2 2 2

4 4 4
4 2 0 2 4 4 2 0 2 4 4 2 0 2 4

iteration 7 iteration 15 iteration 31


4 4 4

2 2 2

0 0 0

2 2 2

4 4 4
4 2 0 2 4 4 2 0 2 4 4 2 0 2 4

6/31
The Perceptron: Optimization View

Proposition
The perceptron can be seen as a gradient descent of the error function
N
1 X
E(w, b) = max(0, −z (k) t(k) )
N k=1 | {z }
Ek (w, b)

Also known as the Hinge Loss.

Proof.
∂Ek ∂z (k) (k) 
w−η = w − η · 1−z(k) t(k) >0 · − t
∂w ∂w
∂z (k) (k) 
= w − η · 1y(k) ̸=t(k) · − t
∂w
= w + η · 1y(k) ̸=t(k) · x(k) t(k)

which is the parameter update equation of the perceptron algorithm. We proceed


similarly for the parameter b.

7/31
Perceptron vs. Nearest Mean Classier

The nearest mean classier rst builds the means of two classes:

1 X (k) 1 X (k)
µ1 = x and µ2 = x
|C1 | k∈C |C2 | k∈C
1 2

Then it predicts:

class 1 if∥x − µ1 ∥ < ∥x − µ2 ∥


class 2 if∥x − µ1 ∥ > ∥x − µ2 ∥

8/31
Perceptron vs. Nearest Mean Classier

Equivalent formulation of the nearest mean classier:

class 1 if∥x − µ1 ∥2 < ∥x − µ2 ∥2


class 2 if∥x − µ1 ∥2 > ∥x − µ2 ∥2

(raising distances to the square does not change the decision).

The nearest mean classier can then be further developed to become expressible as
a linear classier:

y = sign(∥x − µ2 ∥2 − ∥x − µ1 ∥2 )
= sign(∥x∥2 − 2µ⊤ 2 2 ⊤ 2
2 x + ∥µ2 ∥ − ∥x∥ + 2µ1 x − ∥µ1 ∥ )

= sign(2(µ1 − µ2 )⊤ x + ∥µ2 ∥2 − ∥µ1 ∥2 )


| {z } | {z }
w b

where y=1 corresponds to class 1 and y = −1 corresponds to class 2.

9/31
Perceptron vs. Nearest Mean Classier
Question:
▶ Both the perceptron and the nearest mean classier are linear classiers.
Then, how do the perceptron and nearest mean classier dier?

Observation
▶ Perceptron always separates data when it is linearly separable. This is not
always the case for the nearest mean classier, especially when the data of
each class is elongated.

Nearest mean classier describes the data, perceptron minimizes the error.

10/31
Part 2 Training Multilayer Networks

11/31
From Perceptrons to Training Deep Networks
Recap: The perceptron performs a gradient descent of the error function

N
1 X
E(w, b) = max(0, −z (k) t(k) )
N k=1 | {z }
Ek (w, b)

Idea: Generalize the formulation where z is not the output of the perceptron, but
of any multilayer neural network.

Question: How to compute ∂E/∂θ, i.e. the gradient of the newly dened error
function w.r.t. the model parameters?

12/31
Naive Approach: Numerical Dierentiation

Formula for numerical dierentiation:

∂E E(θ + ϵ · δt ) − E(θ)
∀t : = lim
∂θt ϵ→0 ϵ

where δt is an indicator vector for the parameter t.

Properties:
▶ Can be applied to any error function E (not necessary the error of a neural
network).

▶ Need to evaluate the function as many times as there are parameters (→ slow
when the number of parameters is large).

▶ A neural network typically has between 103 and 109 parameters (→ numerical
dierentiation unfeasible).

▶ Still useful as a unit-test for verifying gradient computation.

▶ Because ϵ and the numerator are very small, for numerical dierentiation to
work, one must use high-precision (e.g. oat64 rather than oat32).

13/31
Better Approach: The Chain Rule

Suppose that some parameter of interest θq (one element of the parameter vector
θ) is linked to the output of the network through some sequence of functions.

layer 1 layer 2

µq a b z

The chain rule for derivatives states that

∂z ∂z ∂b ∂a
=
∂θq ∂b ∂a ∂θq

i.e. the derivative w.r.t. the parameter of interest is the product of local derivatives
along the path connecting θq to z.

14/31
The Multivariate Chain Rule

In practice, some parameter of interest may be linked to the output of the network
through multiple paths (formed by all neurons between them):

layer 1 layer 2

a1 b1

µq a2 b2 z

...

...
The chain rule can be extended to this multivariate scenario by enumerating all the
paths between θq and z

∂z X X ∂z ∂bj ∂ai
=
∂θq i j
∂bj ∂ai ∂θq
P P
where i and j run over the indices of the neurons in the corresponding layers.
Its complexity grows exponentially with the number of layers.

15/31
Factor Structure in the Multivariate Chain Rule

layer 1 layer 2 ▶ Computation can be rewritten in a


way that summing operation can be

a1 b1 performed incrementally.

▶ Intermediate computation can be


reused for dierent paths, and for
dierent parameters for which we
µq a2 b2 z would like to compute the gradient.

▶ Overall, the resulting gradient


...

...

computation (w.r.t. of all


parameters in the network) becomes
linear with the size of the network
δi
(⇒ fast!)
z }| {
∂z X ∂ai X ∂bj ∂z ▶ The algorithm is known as the Error
=
∂θq ∂θq j ∂ai ∂bj Backpropagation algorithm
i |{z} (Rumelhart 1986).
δj

16/31
Part 3
Worked-Through Example and
Formalization

17/31
Worked through Example

a1 w13 w35 a5 w57

w23 a3 w36 z7
a2 a6 w67
w24 a4 w46

Forward pass: a1 = x1
a2 = x2
z3 = a1 w13 + a2 w23 a3 = g(z3 )
z4 = a2 w24 a4 = g(z4 )
z5 = a3 w35 a5 = g(z5 )
z6 = a3 w36 + a4 w46 a6 = g(z6 )
z7 = a5 w57 + a6 w67
E = max(0, −z7 t)

18/31
Worked through Example

a 1 = x1
a1 w13 w35 a5 w57 a 2 = x2
z3 = a1 w13 + a2 w23 a3 = g(z3 )
w23 a3 w36 z7
z4 = a2 w24 a4 = g(z4 )
a2 a6 w67 z5 = a3 w35 a5 = g(z5 )
z6 = a3 w36 + a4 w46 a6 = g(z6 )
w24 a4 w46
z7 = a5 w57 + a6 w67
E = max(0, −z7 t)

Backward pass:
∂E
δ7 = = I(−z7 · t > 0) · (−t)
∂z7

19/31
Worked through Example

a 1 = x1
a1 w13 w35 a5 w57 a 2 = x2
z3 = a1 w13 + a2 w23 a3 = g(z3 )
w23 a3 w36 z7
z4 = a2 w24 a4 = g(z4 )
a2 a6 w67 z5 = a3 w35 a5 = g(z5 )
z6 = a3 w36 + a4 w46 a6 = g(z6 )
w24 a4 w46
z7 = a5 w57 + a6 w67
E = max(0, −z7 t)

Backward pass:
∂E
δ7 = = I(−z7 · t > 0) · (−t)
∂z7
∂E ∂E ∂z7
= = δ7 · a6
∂w67 ∂z7 ∂w67
∂E ∂E ∂z7
= = δ7 · a5
∂w57 ∂z7 ∂w57

19/31
Worked through Example

a 1 = x1
a1 w13 w35 a5 w57 a 2 = x2
z3 = a1 w13 + a2 w23 a3 = g(z3 )
w23 a3 w36 z7
z4 = a2 w24 a4 = g(z4 )
a2 a6 w67 z5 = a3 w35 a5 = g(z5 )
z6 = a3 w36 + a4 w46 a6 = g(z6 )
w24 a4 w46
z7 = a5 w57 + a6 w67
E = max(0, −z7 t)

Backward pass:
∂E
δ7 = = I(−z7 · t > 0) · (−t)
∂z7
∂E ∂E ∂z7
δ6 = = = δ7 · w67
∂a6 ∂z7 ∂a6
∂E ∂E ∂z7
δ5 = = = δ7 · w57
∂a5 ∂z7 ∂a5

19/31
Worked through Example
a 1 = x1
a1 w13 w35 a5 w57 a 2 = x2
z3 = a1 w13 + a2 w23 a3 = g(z3 )
w23 a3 w36 z7
z4 = a2 w24 a4 = g(z4 )
a2 a6 w67 z5 = a3 w35 a5 = g(z5 )
z6 = a3 w36 + a4 w46 a6 = g(z6 )
w24 a4 w46
z7 = a5 w57 + a6 w67
E = max(0, −z7 t)

Backward pass:
∂E ∂E ∂z7
δ6 = = = δ7 · w67
∂a6 ∂z7 ∂a6
∂E ∂E ∂z7
δ5 = = = δ7 · w57
∂a5 ∂z7 ∂a5
∂E ∂E ∂a6 ∂z6 ′
= = δ6 · g (z6 ) · a4
∂w46 ∂a6 ∂z6 ∂w46
∂E ∂E ∂a6 ∂z6 ′
= = δ6 · g (z6 ) · a3
∂w36 ∂a6 ∂z6 ∂w36
∂E ∂z5 ∂a5 ∂E ′
= = δ5 · g (z5 ) · a3
∂w35 ∂w35 ∂z5 ∂a5

19/31
Worked through Example

a 1 = x1
a1 w13 w35 a5 w57 a 2 = x2
z3 = a1 w13 + a2 w23 a3 = g(z3 )
w23 a3 w36 z7
z4 = a2 w24 a4 = g(z4 )
a2 a6 w67 z5 = a3 w35 a5 = g(z5 )
z6 = a3 w36 + a4 w46 a6 = g(z6 )
w24 a4 w46
z7 = a5 w57 + a6 w67
E = max(0, −z7 t)

Backward pass:
∂E ∂E ∂z7
δ6 = = = δ7 · w67
∂a6 ∂z7 ∂a6
∂E ∂E ∂z7
δ5 = = = δ7 · w57
∂a5 ∂z7 ∂a5
∂E ∂E ∂a6 ∂z6 ′
δ4 = = = δ6 · g (z6 ) · w46
∂a4 ∂a6 ∂z6 ∂a4
∂E ∂E ∂a6 ∂z6 ∂E ∂a5 ∂z5 ′ ′
δ3 = = + = δ6 · g (z6 ) · w36 + δ5 · g (z5 ) · w35
∂a3 ∂a6 ∂z6 ∂a3 ∂a5 ∂z5 ∂a3

19/31
Worked through Example
a 1 = x1
a1 w13 w35 a5 w57 a 2 = x2
z3 = a1 w13 + a2 w23 a3 = g(z3 )
w23 a3 w36 z7
z4 = a2 w24 a4 = g(z4 )
a2 a6 w67 z5 = a3 w35 a5 = g(z5 )
z6 = a3 w36 + a4 w46 a6 = g(z6 )
w24 a4 w46
z7 = a5 w57 + a6 w67
E = max(0, −z7 t)

Backward pass:
∂E ∂E ∂a6 ∂z6 ′
δ4 = = = δ6 · g (z6 ) · w46
∂a4 ∂a6 ∂z6 ∂a4
∂E ∂E ∂a6 ∂z6 ∂E ∂a5 ∂z5 ′ ′
δ3 = = + = δ6 · g (z6 ) · w36 + δ5 · g (z5 ) · w35
∂a3 ∂a6 ∂z6 ∂a3 ∂a5 ∂z5 ∂a3
∂E ∂E ∂a4 ∂z4 ′
= = δ4 · g (z4 ) · a2
∂w24 ∂a4 ∂z4 ∂w24
∂E ∂E ∂a3 ∂z3 ′
= = δ3 · g (z3 ) · a2
∂w23 ∂a3 ∂z3 ∂w23
∂E ∂E ∂a3 ∂z3 ′
= = δ3 · g (z3 ) · a1
∂w13 ∂a3 ∂z3 ∂w13

19/31
Formalization for a Standard Neural Network

The error gradient can be propagated from layer to layer using the chain rule:

∂E X ∂E ∂ak ∂zk
= · ·
∂aj ∂a ∂z
k | {z } | {zk}
k ∂aj
|{z} |{z}
δj δk g ′ (zk ) wjk

And gradients w.r.t. parameters at each layer can be extracted as:

∂E ∂E ∂ak ∂zk
= · ·
∂wjk ∂ak ∂zk ∂wjk
| {z } | {z } | {z }
δk g ′ (zk ) aj

20/31
Matrix Formulation
Observation:
▶ Backpropagation equations can be written as matrix-vector products, or outer
products:

Neuron-wise Layer-wise
X

δj = δk g (zk )wjk δ (l−1)
= W (l−1,l) · (g ′ (z (l) ) ⊙ δ (l) )
k
∂E ∂E
= δk g ′ (zk )aj = a(l−1) · (g ′ (z (l) ) ⊙ δ (l) )⊤
∂wjk ∂W (l−1,l)

where:

ˆ j and k are indices for neurons at layer l − 1 and l respectively,


ˆ ⊙ is an element-wise multiplication,
ˆ g ′ (·) is the derivative of g applied element-wise,
ˆ W (l−1,l) is a matrix of size (dl−1 × dl ), where dl−1 and dl indicate the
number of neurons at layers l − 1 and l respectively.

Note:
▶ Further vectorization can be achieved by computing the gradient for multiple
data points at once, in which case, we have matrix-matrix products.

21/31
Part 4 Further Remarks / Advanced Topics

22/31
Choice of Nonlinear Activation Function

In practice, for training to proceed, the nonlinear function must be chosen in a way
that:

1. Its gradient is dened (almost) everywhere.

2. There is a signicant portion of the input domain where the gradient is


non-zero.

3. Gradient must be informative i.e. indicate decrease/increase of the activation


function.

Common activation functions: Problematic activation functions:


▶ g(z) = exp(z)/(1 + exp(z)) ▶ g(z) = max(0, z − 100)
▶ g(z) = tanh(z) ▶ g(z) = 1z>0
▶ g(z) = max(0, z) ▶ g(z) = sin(100 · z)

23/31
Neural Networks with Shared Parameters
Shared parameters can be handled by treating original parameters as neurons gen-
erated from the new parameters, and appling the chain rule one step further:

w13 ...
a1 u v a5 u w24 ...
u ...
v a3 u z7 w36 w23 ...
v w57 ... ...
a2 a6 w35
v ...
u a4 v w46
w67 ...

Chain rule equations:

∂z7 ∂z7 ∂w13 ∂z7 ∂w24 ∂z7 ∂w36 ∂z7 ∂w57


= + + +
∂u ∂w13 ∂u ∂w24 ∂u ∂w36 ∂u ∂w57 ∂u
| {z } | {z } | {z } | {z }
1 1 1 1

∂z7 ∂z7 ∂w23 ∂z7 ∂w35 ∂z7 ∂w46 ∂z7 ∂w67


= + + +
∂v ∂w23 ∂v ∂w35 ∂v ∂w46 ∂v ∂w67 ∂v
| {z } | {z } | {z } | {z }
1 1 1 1

24/31
Automatic Dierentiation

Automatic Dierentiation:
▶ Generates automatically backpropagation equations from the forward
equations.

▶ Automatic dierentiation became widely available in neural network libraries


(PyTorch, Tensorow, JAX, etc.)

Consequences:
▶ In practice we do not need to do backpropagation anymore. We just need to
program the forward pass, and the backward pass comes for free.

▶ This has enabled researchers to develop neural networks that are way more
complex, and with much more heterogeneous structures (e.g. ResNet, Yolo,
transformers, etc.).

▶ Only in few cases, it is still useful to express the gradient analytically (e.g. to
analyze theoretically the stability of a gradient descent procedure, such as the
vanishing/exploding gradients problem in recurrent neural networks).

25/31
Simple algorithm for training a neural network

Basic gradient descent algorithm:


Initialize vector of parameters θ at random.
for t = 1 . . . T do
Compute for all data points the forward pass
Compute the error function E(θ)
Extract the gradient ∇E(θ) using backpropagation
Perform a gradient step, i.e.

θ ← θ − γ · ∇E(θ)

end for
▶ The parameter γ is a learning rate that needs to be set by the user.

26/31
Part 5 Neural Network at Work

27/31
Neural Network at Work
iteration 0 iteration 1 iteration 3
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
4 2 0 2 4 4 2 0 2 4 4 2 0 2 4

iteration 7 iteration 15 iteration 31


4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
4 2 0 2 4 4 2 0 2 4 4 2 0 2 4

Observation:
▶ After enough iterations, all points are on the correct side of the decision
boundary (like for the perceptron, but even if the data is not linearly
separable).

28/31
Neural Network at Work

... still not very fast, and not an optimal decision


boundary.

4
iteration 31 Consideration for next lectures:
3 Optimization (Lectures 34)
2
1 ▶ How to make training faster? (especially
0 important if we consider large problems with
1 many input variables.)
2
3
Regularization (Lectures 56)
4 ▶ Decision function doesn't look nice. Unlikely to
4 2 0 2 4
work well for new data points. Can we introduce
a mechanism in the learning proceedure that
promotes more regular and well-generalizing
decision functions?

29/31
Summary

30/31
Summary

▶ Error of a classier can be minimized using gradient descent (e.g. Perceptron,


neural network + backpropagation).

▶ Error backpropagation is computationally ecient way of computing the


gradient (much faster than using the limit formulation of the derivative).

▶ Error backpropagation is an application of the multivariate chain rule, where


the dierent terms can be factored due to the structure of the neural network
graph.

▶ In practice, we most of the time do not need to program error


backpropagation manually, and we can instead use automatic dierentation
techniques available in most modern neural network libraries.

▶ Error backpropagation only extracts the gradient. This is not enough for
training a neural network successfully, one still needs to make sure the
gradient descent is carried out eciently (lectures 34) and that the learned
model is robust to generalize well to new data points (lectures 56).

31/31

You might also like