0% found this document useful (0 votes)

17 views37 pages

Lecture 02

Uploaded by

Tim Widmoser

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views37 pages

Lecture 02

Uploaded by

Tim Widmoser

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

WiSe 2023/24

Deep Learning 1

Lecture 2 Error Backpropagation

Outline

▶ The Perceptron algorithm

▶ Perceptron as gradient descent

▶ How to compute the gradient in a neural network

▶ Numerical gradient computation

▶ The error backpropagation algorithm

▶ Chain rule and multivariate chain rule
▶ Worked through example
▶ General equation for backpropagation
▶ Vectorization of backpropagation
▶ Automatic dierentiation

1/31
Part 1 The Perceptron

2/31
The Perceptron

▶ Algorithm proposed in 1958 by F. Rosenblatt to

train single-layer neural networks.

▶ The algorithm produces classiers that prefectly

separates training data (if the data is linearly
separable).

▶ The algorithm consists of simple and cheap

iterative procedure.

F. Rosenblatt (19281971)

3/31
The Perceptron

Structure of the perceptron:

▶ A weighted sum of the input
x1 features:
w1
Pd
x2 w2 z= i=1 wi xi + b
y ⊤
=w x+b
wd
followed by the sign function
xd
y = sign(z)
Problem formulation: Let

x(1) , x(2) , . . . , x(N ) ∈ Rd

be our input data and t(1) , t(2) , . . . , t(N ) ∈ {−1, 1} our corresponding labels (aka.
targets). The goal of the perceptron is to learn a collection of parameters (w, b)
such that all points are correctly classied, i.e.

∀N
k=1 : y
(k)
= t(k)

4/31
The Perceptron Algorithm

Recall that for each data point, predictions of our perceptron are computed as:

z (k) = w⊤ x(k) + b
y (k) = sign(z (k) )

Perceptron algorithm
▶ Iterate (multiple times from k = 1 . . . N ).
▶ If x(k) is correctly classied (y
(k) = t(k) ), continue.
▶ If x(k) is wrongly classied (y
(k) ̸= t(k) ), update the perceptron:

(k) (k)
w ←w+η·x t
(k)
b←b+η·t

where η is a learning rate.

▶ Stop once all examples are correctly classied.

5/31
Perceptron at Work

iteration 0 iteration 1 iteration 3

4 4 4

2 2 2

0 0 0

2 2 2

4 4 4
4 2 0 2 4 4 2 0 2 4 4 2 0 2 4

iteration 7 iteration 15 iteration 31

4 4 4

2 2 2

0 0 0

2 2 2

4 4 4
4 2 0 2 4 4 2 0 2 4 4 2 0 2 4

6/31
The Perceptron: Optimization View

Proposition
The perceptron can be seen as a gradient descent of the error function
N
1 X
E(w, b) = max(0, −z (k) t(k) )
N k=1 | {z }
Ek (w, b)

Also known as the Hinge Loss.

Proof.
∂Ek ∂z (k) (k)
w−η = w − η · 1−z(k) t(k) >0 · − t
∂w ∂w
∂z (k) (k)
= w − η · 1y(k) ̸=t(k) · − t
∂w
= w + η · 1y(k) ̸=t(k) · x(k) t(k)

which is the parameter update equation of the perceptron algorithm. We proceed

similarly for the parameter b.

7/31
Perceptron vs. Nearest Mean Classier

The nearest mean classier rst builds the means of two classes:

1 X (k) 1 X (k)
µ1 = x and µ2 = x
|C1 | k∈C |C2 | k∈C
1 2

Then it predicts:

class 1 if∥x − µ1 ∥ < ∥x − µ2 ∥

class 2 if∥x − µ1 ∥ > ∥x − µ2 ∥

8/31
Perceptron vs. Nearest Mean Classier

Equivalent formulation of the nearest mean classier:

class 1 if∥x − µ1 ∥2 < ∥x − µ2 ∥2

class 2 if∥x − µ1 ∥2 > ∥x − µ2 ∥2

(raising distances to the square does not change the decision).

The nearest mean classier can then be further developed to become expressible as
a linear classier:

y = sign(∥x − µ2 ∥2 − ∥x − µ1 ∥2 )
= sign(∥x∥2 − 2µ⊤ 2 2 ⊤ 2
2 x + ∥µ2 ∥ − ∥x∥ + 2µ1 x − ∥µ1 ∥ )

= sign(2(µ1 − µ2 )⊤ x + ∥µ2 ∥2 − ∥µ1 ∥2 )

| {z } | {z }
w b

where y=1 corresponds to class 1 and y = −1 corresponds to class 2.

9/31
Perceptron vs. Nearest Mean Classier
Question:
▶ Both the perceptron and the nearest mean classier are linear classiers.
Then, how do the perceptron and nearest mean classier dier?

Observation
▶ Perceptron always separates data when it is linearly separable. This is not
always the case for the nearest mean classier, especially when the data of
each class is elongated.

Nearest mean classier describes the data, perceptron minimizes the error.

10/31
Part 2 Training Multilayer Networks

11/31
From Perceptrons to Training Deep Networks
Recap: The perceptron performs a gradient descent of the error function

N
1 X
E(w, b) = max(0, −z (k) t(k) )
N k=1 | {z }
Ek (w, b)

Idea: Generalize the formulation where z is not the output of the perceptron, but
of any multilayer neural network.

Question: How to compute ∂E/∂θ, i.e. the gradient of the newly dened error
function w.r.t. the model parameters?

12/31
Naive Approach: Numerical Dierentiation

Formula for numerical dierentiation:

∂E E(θ + ϵ · δt ) − E(θ)
∀t : = lim
∂θt ϵ→0 ϵ

where δt is an indicator vector for the parameter t.

Properties:
▶ Can be applied to any error function E (not necessary the error of a neural
network).

▶ Need to evaluate the function as many times as there are parameters (→ slow
when the number of parameters is large).

▶ A neural network typically has between 103 and 109 parameters (→ numerical
dierentiation unfeasible).

▶ Still useful as a unit-test for verifying gradient computation.

▶ Because ϵ and the numerator are very small, for numerical dierentiation to
work, one must use high-precision (e.g. oat64 rather than oat32).

13/31
Better Approach: The Chain Rule

Suppose that some parameter of interest θq (one element of the parameter vector
θ) is linked to the output of the network through some sequence of functions.

layer 1 layer 2

µq a b z

The chain rule for derivatives states that

∂z ∂z ∂b ∂a
=
∂θq ∂b ∂a ∂θq

i.e. the derivative w.r.t. the parameter of interest is the product of local derivatives
along the path connecting θq to z.

14/31
The Multivariate Chain Rule

In practice, some parameter of interest may be linked to the output of the network
through multiple paths (formed by all neurons between them):

layer 1 layer 2

a1 b1

µq a2 b2 z

...

...
The chain rule can be extended to this multivariate scenario by enumerating all the
paths between θq and z

∂z X X ∂z ∂bj ∂ai
=
∂θq i j
∂bj ∂ai ∂θq
P P
where i and j run over the indices of the neurons in the corresponding layers.
Its complexity grows exponentially with the number of layers.

15/31
Factor Structure in the Multivariate Chain Rule

layer 1 layer 2 ▶ Computation can be rewritten in a

way that summing operation can be

a1 b1 performed incrementally.

▶ Intermediate computation can be

reused for dierent paths, and for
dierent parameters for which we
µq a2 b2 z would like to compute the gradient.

▶ Overall, the resulting gradient

...

computation (w.r.t. of all

parameters in the network) becomes
linear with the size of the network
δi
(⇒ fast!)
z }| {
∂z X ∂ai X ∂bj ∂z ▶ The algorithm is known as the Error
=
∂θq ∂θq j ∂ai ∂bj Backpropagation algorithm
i |{z} (Rumelhart 1986).
δj

16/31
Part 3
Worked-Through Example and
Formalization

17/31
Worked through Example

a1 w13 w35 a5 w57

w23 a3 w36 z7
a2 a6 w67
w24 a4 w46

Forward pass: a1 = x1
a2 = x2
z3 = a1 w13 + a2 w23 a3 = g(z3 )
z4 = a2 w24 a4 = g(z4 )
z5 = a3 w35 a5 = g(z5 )
z6 = a3 w36 + a4 w46 a6 = g(z6 )
z7 = a5 w57 + a6 w67
E = max(0, −z7 t)

18/31
Worked through Example

a 1 = x1
a1 w13 w35 a5 w57 a 2 = x2
z3 = a1 w13 + a2 w23 a3 = g(z3 )
w23 a3 w36 z7
z4 = a2 w24 a4 = g(z4 )
a2 a6 w67 z5 = a3 w35 a5 = g(z5 )
z6 = a3 w36 + a4 w46 a6 = g(z6 )
w24 a4 w46
z7 = a5 w57 + a6 w67
E = max(0, −z7 t)

Backward pass:
∂E
δ7 = = I(−z7 · t > 0) · (−t)
∂z7

19/31
Worked through Example

Backward pass:
∂E
δ7 = = I(−z7 · t > 0) · (−t)
∂z7
∂E ∂E ∂z7
= = δ7 · a6
∂w67 ∂z7 ∂w67
∂E ∂E ∂z7
= = δ7 · a5
∂w57 ∂z7 ∂w57

19/31
Worked through Example

Backward pass:
∂E
δ7 = = I(−z7 · t > 0) · (−t)
∂z7
∂E ∂E ∂z7
δ6 = = = δ7 · w67
∂a6 ∂z7 ∂a6
∂E ∂E ∂z7
δ5 = = = δ7 · w57
∂a5 ∂z7 ∂a5

19/31
Worked through Example
a 1 = x1
a1 w13 w35 a5 w57 a 2 = x2
z3 = a1 w13 + a2 w23 a3 = g(z3 )
w23 a3 w36 z7
z4 = a2 w24 a4 = g(z4 )
a2 a6 w67 z5 = a3 w35 a5 = g(z5 )
z6 = a3 w36 + a4 w46 a6 = g(z6 )
w24 a4 w46
z7 = a5 w57 + a6 w67
E = max(0, −z7 t)

Backward pass:
∂E ∂E ∂z7
δ6 = = = δ7 · w67
∂a6 ∂z7 ∂a6
∂E ∂E ∂z7
δ5 = = = δ7 · w57
∂a5 ∂z7 ∂a5
∂E ∂E ∂a6 ∂z6 ′
= = δ6 · g (z6 ) · a4
∂w46 ∂a6 ∂z6 ∂w46
∂E ∂E ∂a6 ∂z6 ′
= = δ6 · g (z6 ) · a3
∂w36 ∂a6 ∂z6 ∂w36
∂E ∂z5 ∂a5 ∂E ′
= = δ5 · g (z5 ) · a3
∂w35 ∂w35 ∂z5 ∂a5

19/31
Worked through Example

Backward pass:
∂E ∂E ∂z7
δ6 = = = δ7 · w67
∂a6 ∂z7 ∂a6
∂E ∂E ∂z7
δ5 = = = δ7 · w57
∂a5 ∂z7 ∂a5
∂E ∂E ∂a6 ∂z6 ′
δ4 = = = δ6 · g (z6 ) · w46
∂a4 ∂a6 ∂z6 ∂a4
∂E ∂E ∂a6 ∂z6 ∂E ∂a5 ∂z5 ′ ′
δ3 = = + = δ6 · g (z6 ) · w36 + δ5 · g (z5 ) · w35
∂a3 ∂a6 ∂z6 ∂a3 ∂a5 ∂z5 ∂a3

Backward pass:
∂E ∂E ∂a6 ∂z6 ′
δ4 = = = δ6 · g (z6 ) · w46
∂a4 ∂a6 ∂z6 ∂a4
∂E ∂E ∂a6 ∂z6 ∂E ∂a5 ∂z5 ′ ′
δ3 = = + = δ6 · g (z6 ) · w36 + δ5 · g (z5 ) · w35
∂a3 ∂a6 ∂z6 ∂a3 ∂a5 ∂z5 ∂a3
∂E ∂E ∂a4 ∂z4 ′
= = δ4 · g (z4 ) · a2
∂w24 ∂a4 ∂z4 ∂w24
∂E ∂E ∂a3 ∂z3 ′
= = δ3 · g (z3 ) · a2
∂w23 ∂a3 ∂z3 ∂w23
∂E ∂E ∂a3 ∂z3 ′
= = δ3 · g (z3 ) · a1
∂w13 ∂a3 ∂z3 ∂w13

19/31
Formalization for a Standard Neural Network

The error gradient can be propagated from layer to layer using the chain rule:

∂E X ∂E ∂ak ∂zk
= · ·
∂aj ∂a ∂z
k | {z } | {zk}
k ∂aj
|{z} |{z}
δj δk g ′ (zk ) wjk

And gradients w.r.t. parameters at each layer can be extracted as:

∂E ∂E ∂ak ∂zk
= · ·
∂wjk ∂ak ∂zk ∂wjk
| {z } | {z } | {z }
δk g ′ (zk ) aj

20/31
Matrix Formulation
Observation:
▶ Backpropagation equations can be written as matrix-vector products, or outer
products:

Neuron-wise Layer-wise
X
′
δj = δk g (zk )wjk δ (l−1)
= W (l−1,l) · (g ′ (z (l) ) ⊙ δ (l) )
k
∂E ∂E
= δk g ′ (zk )aj = a(l−1) · (g ′ (z (l) ) ⊙ δ (l) )⊤
∂wjk ∂W (l−1,l)

where:

j and k are indices for neurons at layer l − 1 and l respectively,

⊙ is an element-wise multiplication,
g ′ (·) is the derivative of g applied element-wise,
W (l−1,l) is a matrix of size (dl−1 × dl ), where dl−1 and dl indicate the
number of neurons at layers l − 1 and l respectively.

Note:
▶ Further vectorization can be achieved by computing the gradient for multiple
data points at once, in which case, we have matrix-matrix products.

21/31
Part 4 Further Remarks / Advanced Topics

22/31
Choice of Nonlinear Activation Function

In practice, for training to proceed, the nonlinear function must be chosen in a way
that:

1. Its gradient is dened (almost) everywhere.

2. There is a signicant portion of the input domain where the gradient is

non-zero.

3. Gradient must be informative i.e. indicate decrease/increase of the activation

function.

Common activation functions: Problematic activation functions:

▶ g(z) = exp(z)/(1 + exp(z)) ▶ g(z) = max(0, z − 100)
▶ g(z) = tanh(z) ▶ g(z) = 1z>0
▶ g(z) = max(0, z) ▶ g(z) = sin(100 · z)

23/31
Neural Networks with Shared Parameters
Shared parameters can be handled by treating original parameters as neurons gen-
erated from the new parameters, and appling the chain rule one step further:

w13 ...
a1 u v a5 u w24 ...
u ...
v a3 u z7 w36 w23 ...
v w57 ... ...
a2 a6 w35
v ...
u a4 v w46
w67 ...

Chain rule equations:

∂z7 ∂z7 ∂w13 ∂z7 ∂w24 ∂z7 ∂w36 ∂z7 ∂w57

= + + +
∂u ∂w13 ∂u ∂w24 ∂u ∂w36 ∂u ∂w57 ∂u
| {z } | {z } | {z } | {z }
1 1 1 1

∂z7 ∂z7 ∂w23 ∂z7 ∂w35 ∂z7 ∂w46 ∂z7 ∂w67

= + + +
∂v ∂w23 ∂v ∂w35 ∂v ∂w46 ∂v ∂w67 ∂v
| {z } | {z } | {z } | {z }
1 1 1 1

24/31
Automatic Dierentiation

Automatic Dierentiation:
▶ Generates automatically backpropagation equations from the forward
equations.

▶ Automatic dierentiation became widely available in neural network libraries

(PyTorch, Tensorow, JAX, etc.)

Consequences:
▶ In practice we do not need to do backpropagation anymore. We just need to
program the forward pass, and the backward pass comes for free.

▶ This has enabled researchers to develop neural networks that are way more
complex, and with much more heterogeneous structures (e.g. ResNet, Yolo,
transformers, etc.).

▶ Only in few cases, it is still useful to express the gradient analytically (e.g. to
analyze theoretically the stability of a gradient descent procedure, such as the
vanishing/exploding gradients problem in recurrent neural networks).

25/31
Simple algorithm for training a neural network

Basic gradient descent algorithm:

Initialize vector of parameters θ at random.
for t = 1 . . . T do
Compute for all data points the forward pass
Compute the error function E(θ)
Extract the gradient ∇E(θ) using backpropagation
Perform a gradient step, i.e.

θ ← θ − γ · ∇E(θ)

end for
▶ The parameter γ is a learning rate that needs to be set by the user.

26/31
Part 5 Neural Network at Work

27/31
Neural Network at Work
iteration 0 iteration 1 iteration 3
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
4 2 0 2 4 4 2 0 2 4 4 2 0 2 4

iteration 7 iteration 15 iteration 31

4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
4 2 0 2 4 4 2 0 2 4 4 2 0 2 4

Observation:
▶ After enough iterations, all points are on the correct side of the decision
boundary (like for the perceptron, but even if the data is not linearly
separable).

28/31
Neural Network at Work

... still not very fast, and not an optimal decision

boundary.

4
iteration 31 Consideration for next lectures:
3 Optimization (Lectures 34)
2
1 ▶ How to make training faster? (especially
0 important if we consider large problems with
1 many input variables.)
2
3
Regularization (Lectures 56)
4 ▶ Decision function doesn't look nice. Unlikely to
4 2 0 2 4
work well for new data points. Can we introduce
a mechanism in the learning proceedure that
promotes more regular and well-generalizing
decision functions?

29/31
Summary

30/31
Summary

▶ Error of a classier can be minimized using gradient descent (e.g. Perceptron,

neural network + backpropagation).

▶ Error backpropagation is computationally ecient way of computing the

gradient (much faster than using the limit formulation of the derivative).

▶ Error backpropagation is an application of the multivariate chain rule, where

the dierent terms can be factored due to the structure of the neural network
graph.

▶ In practice, we most of the time do not need to program error

backpropagation manually, and we can instead use automatic dierentation
techniques available in most modern neural network libraries.

▶ Error backpropagation only extracts the gradient. This is not enough for
training a neural network successfully, one still needs to make sure the
gradient descent is carried out eciently (lectures 34) and that the learned
model is robust to generalize well to new data points (lectures 56).

31/31

Haykin, Xue-Neural Networks and Learning Machines 3ed Soln
53% (19)
Haykin, Xue-Neural Networks and Learning Machines 3ed Soln
103 pages
Dsa Assignment 1 (3) by Zamir Ali 091440
No ratings yet
Dsa Assignment 1 (3) by Zamir Ali 091440
44 pages
Deep Neural Network (DNN)
100% (1)
Deep Neural Network (DNN)
80 pages
Characteristics of Arrays: Array in C
No ratings yet
Characteristics of Arrays: Array in C
6 pages
Haykin Xue Neural Networks and Learning Machines 3ed Soln PDF
50% (2)
Haykin Xue Neural Networks and Learning Machines 3ed Soln PDF
103 pages
ECE/CS 559 - Neural Networks Lecture Notes #7: The Backpropagation Algorithm
No ratings yet
ECE/CS 559 - Neural Networks Lecture Notes #7: The Backpropagation Algorithm
9 pages
Graph Theory Paper
No ratings yet
Graph Theory Paper
5 pages
Deadlocks in Operating Systems
No ratings yet
Deadlocks in Operating Systems
37 pages
First
No ratings yet
First
92 pages
Binary Tree in Java
No ratings yet
Binary Tree in Java
79 pages
ML807 Distributed and Federated Learning Slides 2
No ratings yet
ML807 Distributed and Federated Learning Slides 2
211 pages
Rift Valley Uiversity Operational Research: Final Exam (2014) Name: - ID - Part One: Multiple Choices
No ratings yet
Rift Valley Uiversity Operational Research: Final Exam (2014) Name: - ID - Part One: Multiple Choices
3 pages
Introduction To Feed Forward Neural Networks
No ratings yet
Introduction To Feed Forward Neural Networks
121 pages
Neural Networks & Deep Learning 2025
No ratings yet
Neural Networks & Deep Learning 2025
73 pages
Learning 3
No ratings yet
Learning 3
98 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
35 pages
Combinatorial Optimization Problems in Planning and Decision Making - Theory and Applications
No ratings yet
Combinatorial Optimization Problems in Planning and Decision Making - Theory and Applications
523 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
5 - From Linear Models To Multi-Layer Perceptrons
No ratings yet
5 - From Linear Models To Multi-Layer Perceptrons
45 pages
SJNanda - Neural Network
No ratings yet
SJNanda - Neural Network
43 pages
Neural Networks
No ratings yet
Neural Networks
29 pages
SJNanda Neural Network
No ratings yet
SJNanda Neural Network
43 pages
DeepLearning Recap
No ratings yet
DeepLearning Recap
104 pages
cs224n 2023 Lecture03 Neuralnets
No ratings yet
cs224n 2023 Lecture03 Neuralnets
83 pages
Deep Learning Lectures - 2
No ratings yet
Deep Learning Lectures - 2
73 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
Ford Fulkerson
No ratings yet
Ford Fulkerson
31 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
SJNanda Neural Network
No ratings yet
SJNanda Neural Network
47 pages
An Introduction To Mathematics Behind Neural Networks
No ratings yet
An Introduction To Mathematics Behind Neural Networks
5 pages
DAA Unit 1
No ratings yet
DAA Unit 1
84 pages
Lec 6 Karnaugh Map Review
No ratings yet
Lec 6 Karnaugh Map Review
49 pages
RSF G4 Ken Lee Kit Khen
No ratings yet
RSF G4 Ken Lee Kit Khen
6 pages
Lecture 03
No ratings yet
Lecture 03
32 pages
S02 DNN Perceptron Wip
No ratings yet
S02 DNN Perceptron Wip
24 pages
NN 2
No ratings yet
NN 2
12 pages
04 Numerical
No ratings yet
04 Numerical
46 pages
Tut 01
No ratings yet
Tut 01
39 pages
Introduction To Formal Languages, Automata and Computability
No ratings yet
Introduction To Formal Languages, Automata and Computability
26 pages
Lecture04 Neuralnets
No ratings yet
Lecture04 Neuralnets
81 pages
Lecture 3
No ratings yet
Lecture 3
24 pages
L04 Slides - mlp1
No ratings yet
L04 Slides - mlp1
22 pages
Junaid (Quick Sort)
No ratings yet
Junaid (Quick Sort)
14 pages
Deep Learning Module-02 Search Creators
No ratings yet
Deep Learning Module-02 Search Creators
15 pages
Backward Forward Propogation
No ratings yet
Backward Forward Propogation
19 pages
Artificial Neural Networks: Multilayer Perceptrons Backpropagation
No ratings yet
Artificial Neural Networks: Multilayer Perceptrons Backpropagation
71 pages
Curs3site PDF
No ratings yet
Curs3site PDF
38 pages
Backprop Unit 2
No ratings yet
Backprop Unit 2
5 pages
Backpropagation Algorithm
No ratings yet
Backpropagation Algorithm
12 pages
Mod 2.4,2.5,2.6 Architecture Design
No ratings yet
Mod 2.4,2.5,2.6 Architecture Design
20 pages
Comp Sheet2B
No ratings yet
Comp Sheet2B
7 pages
Context Free Grammars Normal
No ratings yet
Context Free Grammars Normal
38 pages
Week 2
No ratings yet
Week 2
17 pages
Gokul Ai.1.2.1
No ratings yet
Gokul Ai.1.2.1
12 pages
Annette Paper
No ratings yet
Annette Paper
7 pages
Back-Propagation Algorithm
No ratings yet
Back-Propagation Algorithm
26 pages
Neural Net 3rdclass
No ratings yet
Neural Net 3rdclass
35 pages
Classification Nearest Neighbor: Jeff Howbert Introduction To Machine Learning Winter 2012 1
No ratings yet
Classification Nearest Neighbor: Jeff Howbert Introduction To Machine Learning Winter 2012 1
19 pages
String Handling
No ratings yet
String Handling
17 pages
2021 SOTA HL MAA Prelim P3
No ratings yet
2021 SOTA HL MAA Prelim P3
6 pages
Neural Networks - Learning
No ratings yet
Neural Networks - Learning
26 pages
Foundations of Deep Learning
No ratings yet
Foundations of Deep Learning
30 pages
Exercise-Dynamic Price-Problem 1.0-Blank For Students
No ratings yet
Exercise-Dynamic Price-Problem 1.0-Blank For Students
6 pages
Closing and Opening: Basic Morphological Algorithms
No ratings yet
Closing and Opening: Basic Morphological Algorithms
16 pages
Lecture 2, Part 2: Backpropagation: Roger Grosse
No ratings yet
Lecture 2, Part 2: Backpropagation: Roger Grosse
9 pages
Unit 3
No ratings yet
Unit 3
6 pages
Sparse Autoencoder
No ratings yet
Sparse Autoencoder
15 pages
Understanding Backpropagation Algorithm - Towards Data Science
No ratings yet
Understanding Backpropagation Algorithm - Towards Data Science
11 pages
SPCC Practicalss
No ratings yet
SPCC Practicalss
6 pages
Machine Learning: Algorithms and Applications: (Continued)
No ratings yet
Machine Learning: Algorithms and Applications: (Continued)
17 pages
Montanari
No ratings yet
Montanari
10 pages
Backpropagation: Loading Data
No ratings yet
Backpropagation: Loading Data
12 pages
Subject: C' Programming For Problem Solving Subject Code: 18CPS23
No ratings yet
Subject: C' Programming For Problem Solving Subject Code: 18CPS23
16 pages
Section 3-5: Lagrange Multipliers: Fxy X y X y
No ratings yet
Section 3-5: Lagrange Multipliers: Fxy X y X y
14 pages
Week 1 Solutions
No ratings yet
Week 1 Solutions
8 pages
Omdm Project - Vinay
No ratings yet
Omdm Project - Vinay
8 pages
Neural Networks: Derivation: 1 Model
No ratings yet
Neural Networks: Derivation: 1 Model
9 pages
Instructions: Textbook Reading
No ratings yet
Instructions: Textbook Reading
2 pages
Operation Research Rme 075
No ratings yet
Operation Research Rme 075
2 pages
CS231n Convolutional Neural Networks For Visual Recognition
No ratings yet
CS231n Convolutional Neural Networks For Visual Recognition
9 pages
Neural Networks Backpropagation Algorithm: COMP4302/COMP5322, Lecture 4, 5
No ratings yet
Neural Networks Backpropagation Algorithm: COMP4302/COMP5322, Lecture 4, 5
11 pages
Programs 13 To 20.
No ratings yet
Programs 13 To 20.
3 pages
Neural Networks: Single Neurons (Continued) : G. Extension of The Delta Rule: Smooth F (Z)
No ratings yet
Neural Networks: Single Neurons (Continued) : G. Extension of The Delta Rule: Smooth F (Z)
5 pages
Dmitrijs Rutko, Faculty of Computing, University of Latvia: Fuzzified Tree Search in Real Domain Games
No ratings yet
Dmitrijs Rutko, Faculty of Computing, University of Latvia: Fuzzified Tree Search in Real Domain Games
1 page
Exercises On Backpropagation
No ratings yet
Exercises On Backpropagation
4 pages
Rush Hour Puzzle
No ratings yet
Rush Hour Puzzle
4 pages
Calculus: Maths of the Gods
From Everand
Calculus: Maths of the Gods
Bill Todorovich
No ratings yet
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Calculus-II (Mathematics) Question Bank
From Everand
Calculus-II (Mathematics) Question Bank
Mohmmad Khaja Shareef
No ratings yet
Shortcuts to College Calculus Refreshment Kit
From Everand
Shortcuts to College Calculus Refreshment Kit
Juan Acevedo
No ratings yet

Lecture 02

Uploaded by

Lecture 02

Uploaded by

WiSe 2023/24

Lecture 2 Error Backpropagation

▶ The Perceptron algorithm

▶ Perceptron as gradient descent

▶ How to compute the gradient in a neural network

▶ Numerical gradient computation

▶ The error backpropagation algorithm

▶ Algorithm proposed in 1958 by F. Rosenblatt to

▶ The algorithm produces classiers that prefectly

▶ The algorithm consists of simple and cheap

Structure of the perceptron:

x(1) , x(2) , . . . , x(N ) ∈ Rd

where η is a learning rate.

▶ Stop once all examples are correctly classied.

iteration 0 iteration 1 iteration 3

iteration 7 iteration 15 iteration 31

Also known as the Hinge Loss.

which is the parameter update equation of the perceptron algorithm. We proceed

class 1 if∥x − µ1 ∥ < ∥x − µ2 ∥

Equivalent formulation of the nearest mean classier:

class 1 if∥x − µ1 ∥2 < ∥x − µ2 ∥2

(raising distances to the square does not change the decision).

= sign(2(µ1 − µ2 )⊤ x + ∥µ2 ∥2 − ∥µ1 ∥2 )

where y=1 corresponds to class 1 and y = −1 corresponds to class 2.

Formula for numerical dierentiation:

where δt is an indicator vector for the parameter t.

▶ Still useful as a unit-test for verifying gradient computation.

The chain rule for derivatives states that

layer 1 layer 2 ▶ Computation can be rewritten in a

▶ Intermediate computation can be

▶ Overall, the resulting gradient

computation (w.r.t. of all

a1 w13 w35 a5 w57

And gradients w.r.t. parameters at each layer can be extracted as:

 j and k are indices for neurons at layer l − 1 and l respectively,

1. Its gradient is dened (almost) everywhere.

2. There is a signicant portion of the input domain where the gradient is

3. Gradient must be informative i.e. indicate decrease/increase of the activation

Common activation functions: Problematic activation functions:

Chain rule equations:

∂z7 ∂z7 ∂w13 ∂z7 ∂w24 ∂z7 ∂w36 ∂z7 ∂w57

∂z7 ∂z7 ∂w23 ∂z7 ∂w35 ∂z7 ∂w46 ∂z7 ∂w67

▶ Automatic dierentiation became widely available in neural network libraries

Basic gradient descent algorithm:

iteration 7 iteration 15 iteration 31

... still not very fast, and not an optimal decision

▶ Error of a classier can be minimized using gradient descent (e.g. Perceptron,

▶ Error backpropagation is computationally ecient way of computing the

▶ Error backpropagation is an application of the multivariate chain rule, where

▶ In practice, we most of the time do not need to program error

You might also like

▶ The algorithm produces classiers that prefectly

▶ Stop once all examples are correctly classied.

Equivalent formulation of the nearest mean classier:

Formula for numerical dierentiation:

j and k are indices for neurons at layer l − 1 and l respectively,

1. Its gradient is dened (almost) everywhere.

2. There is a signicant portion of the input domain where the gradient is

▶ Automatic dierentiation became widely available in neural network libraries

▶ Error of a classier can be minimized using gradient descent (e.g. Perceptron,

▶ Error backpropagation is computationally ecient way of computing the