0% found this document useful (0 votes)

59 views38 pages

Learning From Data: 10: Neural Networks - I

The document discusses the history and definition of neural networks. It introduces backpropagation, which is a method for training neural networks by calculating the gradient of the loss function with respect to the network's weights. The document explains the four equations of backpropagation and provides context on neural networks and their biological inspiration.

Uploaded by

Hamed Aryanfar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views38 pages

Learning From Data: 10: Neural Networks - I

Uploaded by

Hamed Aryanfar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

Learning From Data

10: Neural Networks – I

Jörg Schäfer
Frankfurt University of Applied Sciences
Department of Computer Sciences
Nibelungenplatz 1
D-60318 Frankfurt am Main

Wissen durch Praxis stärkt

1/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I Summer Semester 2022
Content

History

Definition

Backpropagation

The Four Backpropagation Equations

Bibliography

2/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

Practicing is not only playing your instrument, either by yourself
or rehearsing with others - it also includes imagining yourself
practicing. Your brain forms the same neural connections and
muscle memory whether you are imagining the task or actually
doing it.
– Yo-Yo Ma
Everybody right now, they look at the current technology, and
they think, ’OK, that’s what artificial neural nets are.’ And they
don’t realize how arbitrary it is. We just made it up! And there’s
no reason why we shouldn’t make up something else.
– Geoffrey Hinton

3/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

Neural Networks and Neurons

Neural Networks were inspired by biology

Neurons are connected to many other neurons
Neurons process information using action potentials
The brain is capable of processing complex computing tasks using a
network of neurons
For a good exposition of the biological background, see the book by
Raúl Rojas [Roj96].
Shallow Analogy
Be aware, that the analogy is shallow, see [Lic15].

4/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

Neural Networks – General Architecture

Input layer Hidden layer Hidden layer Output layer

w11
x1 w12
w21 y1
x2 w22
w31 y2
x3 w32
w41 y3
x4 w42
w51

5/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

Perceptron – Recap

The perceptron is a neural (network) with just one layer (node). It can
compute the logical functions:
AND
OR
However, it cannot compute the XOR function.

6/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

Perceptron – XOR

o +
From this we conclude

0w1 + 0w2 + w0 ≤ 0 (1)

+ o
0w1 + 1w2 + w0 > 0 (2)
1w1 + 0w2 + w0 > 0 (3)
1w1 + 1w2 + w0 ≤ 0 (4)
x y class
0 0 0 However, these inequalities contradict
each other.
0 1 1
1 0 1
1 1 0

7/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

Excursion: Neural Networks as a General Computing
Paradigm
The theory of computations has a long history:
John von Neumann defined (in ‘First Draft of a Report on the
EDVA” 1945) the von Neumann architecture we still use today
for our machines.
He also introduced a new computational model which he called
cellular automata (1940s together with Stanislav Ulam).
However, already in the 1930s Alonzo Church introduced another
model for computations, so-called recursive functions and lambda
calculus. It is at the heart of functional languages still.
The field of neural networks started with researchers like McCulloch,
Wiener, and von Neumann (1950s). Neural Network could be
thought fo alternative building blocks for computations.

8/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

History

Backpropagation was introduced in early 60’s, see historical survey in

[Sch14].
However, it was unappreciated until 1986 paper [RHW86] by David
Rumelhart, Geoffrey Hinton, and Ronald Williams.
In this paper, several neural networks were described where
backpropagation works far faster than earlier learning approaches.
Today, backpropagation is the standard methodology to train neural
networks.

9/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

Computing Activations
Let us define wjkl as the weight for the connection from the k-th neuron
in the (l − 1) layer to the j-th neuron in the l-th layer:
Input layer Hidden layer 1 Hidden layer 2 Output layer

x1
y1
x2
y2
x3
y3
x4 2
w53

10/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

Computing Activations (cont.)
Let us define bjl as the bias of the j-th neuron in the l-th layer.
Let us define ajl as the activation (output) of the j-th neuron in
the l-th layer.
Input layer Hidden layer 1 Hidden layer 2 Output layer

a12
x1
b11 a22 y1
x2
b21 a32 y2
x3
b31 a42 y3
x4
a52

11/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

Computing Activations (cont.)
The activation ajl of the jth neuron in the l-th layer can be computed
from the activations1 in the (l − 1) layer:
!
wjkl akl−1
X
ajl =σ + bjl (5)
k

This can be compactly written as a matrix equation:

al = σ(wl al−1 + bl ) (6)

by introducing a matrix w of weights and bias and activation vectors b

and a, resp.
Note, that we implicitly use a vectorized version of σ,
i.e. σ(u) := (σ(u1 ), . . . , σ(un )).
1
we define activation functions in the sequel – just assume any function for now
12/38 Jörg Schäfer | Learning From Data | c b na 10: Neural Networks – I
Computing Activations (cont.)

We also define the weighted input zl of the l-th layer as

zl = wl al−1 + bl (7)
Henceforth
al = σ(zl ) (8)

13/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

Learning Strategy
Strategy:
We like to optimize the cost function C by chosing the right weights
w.
∂C ∂C
In order to do so, we compute partial derivatives ∂w and ∂b with
respect to any weight w and any bias b in the network.
Then we can iteratively correct the weights by
∂C
w ←w −η
∂w
and the bias’ by
∂C
, b ←b−η
∂w
where η ∈ R+ is the so-called learning rate.
Note that this is a version of (stochastic) gradient descent optimization.
14/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I
Brute Force

We can perform the optimization by a brute force approach:

Compute ∇C numerically.
This can be easily done as

∂C C (wil + ) − C (wil )
l
:= lim
∂wi →0

However, for a million weights this becomes prohibitive.

Thus, we need a better approach!

15/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

Cost Functions
Backpropagation works for any cost function C satisfying these
assumptions:
1. C is differentiable
2. The cost function can be written as an average over cost functions
of individual training examples – this is important for (stochastic)
gradient descent
3. The cost function can be written as a function of the neural network
outputs.
The standard cost function for regression fulfills 2 and 3 of the above:
1 2
C= y − aL .

2
If the activation function σ is differentiable, C is differentiable, too and
henceforth fulfills 1 above.
16/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I
Activation Functions

For σ, the following functions are common:

1. The step function is defined as
(
0 if s ≤ 0
σΘ (s) := Θ(s) :=
1 if s > 0

with s := wa + b. Note, that the step function is not differentiable at zero.

2. The identity or linear function:

σId (s) = s

17/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

Activation Functions (cont.)
3. The sigmoid function is defined as
es 1
σsigmoid (s) := sigmoid(s) := =
s
e +1 1 + e −s
with s := wa + b. Note that σ ∈ (0, 1) and is differentiable. A small change
in weights or bias will produce a small change in the output. Towards either
end of the sigmoid function the value of s makes a very small change in
σ(s) and this is called vanishing gradients.
4. The tanh function is defined as
2
σtanh (s) := tanh(s) = − 1 = 2 sigmoid(2s) − 1
1 + e −2s
with s := wa + b. Note that sigmoid ∈ (0, 1) and is differentiable. Note -
although the gradient is steeper as compared to sigmoid that tanh also
suffers from vanishing gradients.

18/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

Activation Functions (cont.)

5. As an alternative to the tanh function, sometimes the arctan function is

defined as
σarctanh := arctanh(s) := tan−1 (s)
with s := wa + b.
6. Another common (non-differentiable) activation function is the so-called
REctified Linear Unit (RELU):
(
0 if s ≤ 0
σRELU (s) :=
s if s > 0

with s := wa + b.

19/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

Activation Functions (cont.)

2
0.8

0.8
sigmoid(x)
1
step(x)

id(x)

0
0.4

0.4
0.0

0.0
−2
−10 0 5 10 −2 0 1 2 −10 0 5 10

x x x
1.0

0.0 1.0

8
tanh(x)

atan(x)

relu(x)
0.0

4
−1.0

−1.5

0
−10 0 5 10 −10 0 5 10 −10 0 5 10
20/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I
Partial Derivatives of Activation Functions
Note, that we need the partial derivatives of activation functions for the back
propagation algorithms. Fortunately, these are easy to compute:

∂σΘ (s)
= 0, s 6= 0 (9)
∂s
∂σId
=1 (10)
∂s
∂σsigmoid (s) 1
=− = σsigmoid (s)(1 − σsigmoid (s)) (11)
∂s (1 − e −s )2
∂σtanh (s) 1
= (12)
∂s 1 + s2
∂σarctan (s) 2
= 1 − σtanh (s) (13)
∂s (
∂σRELU (s) 0 if s < 0
= (14)
∂s 1 if s > 0

21/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

Non Differentiable Activation Functions

Note, that some activation functions are non differentiable at least

at certain points.
Application of the theory below including optimization using
(stochastic) gradient descent requires differentiability in all points.
Practically, this is usually no problem as functions are differentiable
except in single points.
Thus, one usually does not “hit” these and/or sub-derviatives can be
used to prove theorems and ensure convergence and estimates.

22/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

Errors

Definition 1
We define the error at layer l caused by the activation zjl as

∂C
δjl := .
∂zjl

23/38 Jörg Schäfer | Learning From Data | c b na 10: Neural Networks – I

First Fundamental Equation of Backpropagation
Lemma 2 (Error in the output layer)
For the output layer, the following is true:

∂C ∂σ(zjL ) ∂C
δjL = L L
= L σ 0 (zjL ) (15)
∂aj ∂zj ∂aj

Note, that
∂C
∂ajL
measures how fast the cost function changes as function of the jth
output – i.e. if C does not depend much on a neuron j, δjL will be small.
σ 0 (zjL ) measures how fast the activation function changes (steepness) at zjL .
2
For the quadratic cost function C = 12 y − aL we have
∂C
= (ajL − yj ).
∂ajL

24/38 Jörg Schäfer | Learning From Data | c b na 10: Neural Networks – I

First Fundamental Equation of Backpropagation (cont.)

We often write the First Fundamental Equation of Backpropagation

in matrix form as follows:

δ L = ∇a C σ 0 (zL ),

where denotes the so-called Hadamard or Schur product of two vectors:

Definition 3
The Hadamard or Schur product of two vectors a and b is defined as the
point-wise product:
(a b)i := ai bi .

25/38 Jörg Schäfer | Learning From Data | c b na 10: Neural Networks – I

First Fundamental Equation of Backpropagation: Proof
Proof.
By definition
∂C
δjL := .
∂zjL
Hence, applying the chain rule

∂C ∂ajL
δjL = .
∂ajL ∂zjL

Now, ajL = σ(zjL ), thus

∂C 0 L
δjL = σ (zj ).
∂ajL

26/38 Jörg Schäfer | Learning From Data | c b na 10: Neural Networks – I

Second Fundamental Equation of Backpropagation

Lemma 4 (Error Recursion)

The error in the l-th layer can be computed from the error in the (l + 1) layer as
follows
δ l = (wl+1 )T δ l+1 σ 0 (zl ). (16)

Suppose δ l+1 is known.

Applying the matrix multiplication (wl+1 )T to it moves the error one layer
backward through the network. It can be interpreted as a measure of the
error at the output of the l-th layer.
By taking the Hadamard product σ 0 (zl ) we move the error backward
through the activation function in layer l.

27/38 Jörg Schäfer | Learning From Data | c b na 10: Neural Networks – I

The First Two Fundamental Equation of Backpropagation

The First Two Fundamental Equation of Backpropagation enable a

recursion:

δ L = ∇a C σ 0 (zL ) (17)

δ l = (wl+1 )T δ l+1 σ 0 (zl ). (18)

First, compute δ L .
Second, compute δ L−1 .
Third, compute δ L−2 .
...

28/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

Second Fundamental Equation of Backpropagation – Proof
Proof.
By definition, we have δjl := ∂C
∂zjl
. On the other hand, using the chain rule to simplify
∂z l+1
δkl+1 we get
P k
the expression k ∂zjl

X ∂z l+1 X ∂z l+1 ∂C X ∂C ∂z l+1 ∂C

k
δkl+1 = k
= k
= = δjl .
∂zjl ∂zjl ∂zkl+1 ∂zkl+1 ∂zjl ∂zjl
k k k

Differentiating the z variables on the other hand yields:

∂zkl+1
= wkjl+1 σ 0 (zjl ).
∂zjl

Putting these together we get

X
δjl = wkjl+1 δkl+1 σ 0 (zjl ) δ l = (wl+1 )T δ l+1 σ 0 (zl )

i.e.
k

29/38 Jörg Schäfer | Learning From Data | c b na 10: Neural Networks – I

Third Fundamental Equation of Backpropagation
We also can compute easily the rate of change of the cost C w.r.t. any
bias in the network:
Lemma 5 (Bias)

∂C
= δjl ⇐⇒ ∇bl C = δ l . (19)
∂bjl

Proof.

∂C ∂C ∂zjl
= = δjl 1 = δjl .
∂bjl ∂zjl ∂bjl

30/38 Jörg Schäfer | Learning From Data | c b na 10: Neural Networks – I

Fourth Fundamental Equation of Backpropagation
Lemma 6 (Rate of change of the cost)
The rate of change of the cost w.r.t. any weight is
∂C
= akl−1 δjl ⇐⇒ ∇wl C = vec(δ l ⊗ al−1 ). (20)
∂wjkl

Proof.

∂C ∂C ∂zjl ∂C
l
= l l
= l akl−1 = δjl akl−1
∂wjk ∂zj ∂wjk ∂zj
As for any a = (a1 , . . . , an ), b = (b1 , . . . , bm ) we have a ⊗ b = (ai bj ),
and for any n × m matrix A we have
vec(Ai,j ) = (a1,1 , . . . , an,1 , a1,2 , . . . , an,2 , . . . , a1,m , . . . , an,m ) the second
equation is equivalent.
31/38 Jörg Schäfer | Learning From Data | c b na 10: Neural Networks – I
Fourth Fundamental Equation of Backpropagation (cont.)

From
∂C
= akl−1 δjl (21)
∂wjkl
we conclude:
If the activation akl−1 is small, so is the gradient ∂C
∂wjkl
.
Thus, weights output from low-activation neurons learn slowly.
σ becomes very flat for saturated neurons.
Thus, weights output for saturated neurons learn slowly.

32/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

Summary
The four fundamental equations of backpropagation
For a neural network with

zl = wl al−1 + bl
al = σ(zl )
∂C
δjl = ,
∂zjl

where C is any cost function satisfying the conditions defined above, the following
relations hold:

δ L = ∇a C σ 0 (zL ) (22)
l l+1 T l+1 0 l

δ = (w ) δ σ (z ) (23)
l
∇bl C = δ (24)
l l−1
∇wl C = vec(δ ⊗ a ). (25)

33/38 Jörg Schäfer | Learning From Data | c b na 10: Neural Networks – I

Backpropagation – Intuition

f10 (x )|f1 (x )
7

%
1
f10 (x ) + f20 (x ) 9 1 |+

'
f20 (x )|f2 (x )

o
backpropagation

Figure: Nothing but the chain rule

34/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

The Backpropagation Algorithm
Algorithm 1 B ac k p ro pag at i o n(x )
1: Set input activation a1 .
2: while Breakcondition not met do
3: for l ∈ [0, . . . L] do
4: Feedforward Pass: Compute zl = wl al−1 + bl and σ(zl )
5: end for
6: Output Error: Compute δ L = ∇a C σ 0 (zL )
7: for l ∈ [L − 1, . . . 0] do
8: Backpropagation Pass: Backpropagate the error using
9: δ L = ∇a C σ 0 (zL )and
10: δ l = (wl+1 )T δ l+1 σ 0 (zl )
11: Update Errors: Apply gradient descent by updating the weights
12: and bias as wl ← wl − η∇wl C and bl ← bl − η∇bl C .
13: end for
14: end while
35/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I
The Backpropagation Algorithm (cont.)

For the breakcondition one usually defines

Fixed limit of iterations
A limit on changes of the weights
A limit on changes of the errors
A limit on cross validation (expensive)

36/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

Backpropagation Efficiency

Backpropagation is much more efficient than brute force as it

requires only two passes (forward and backward) for each iteration.
Being able to quickly calculate the derivatives of the activation
functions is important.

Note, however that no method can guarantee to find the global minimum
– thus we risk of get stuck in a local minimum.

37/38 Jörg Schäfer | Learning From Data | c b na 10: Neural Networks – I

References I

[Lic15] G. Licata, “Are neural networks imitations of mind?” Journal of

Computer Science & Systems Biology (Comput Sci Syst Biol
8), 2015.

[RHW86] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning

representations by back-propagating errors,” Nature, vol. 323, pp.
533–, Oct. 1986. [Online]. Available:
http://dx.doi.org/10.1038/323533a0

[Roj96] R. Rojas, Neural Networks: A Systematic Introduction. Berlin,

Heidelberg: Springer-Verlag, 1996.

[Sch14] J. Schmidhuber, “Deep learning in neural networks: An overview,”

CoRR, vol. abs/1404.7828, 2014.

38/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

Extreme Privacy - Mobile Devices
100% (6)
Extreme Privacy - Mobile Devices
135 pages
Notes On Introduction To Deep Learning
No ratings yet
Notes On Introduction To Deep Learning
19 pages
LAB6
50% (2)
LAB6
5 pages
Sizing of Amine Absorber
No ratings yet
Sizing of Amine Absorber
7 pages
Annette Paper
No ratings yet
Annette Paper
7 pages
Ann MJJ-1
No ratings yet
Ann MJJ-1
64 pages
ML Unit-2
No ratings yet
ML Unit-2
141 pages
Back Propagation
No ratings yet
Back Propagation
56 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
35 pages
MLS 1 - Presentation
No ratings yet
MLS 1 - Presentation
11 pages
ANN MODULE 1 Part2
No ratings yet
ANN MODULE 1 Part2
58 pages
2012-1158. Backpropagation NN
No ratings yet
2012-1158. Backpropagation NN
56 pages
Artificial Neural Networks: Introduction To Computational Neuroscience
No ratings yet
Artificial Neural Networks: Introduction To Computational Neuroscience
42 pages
ML Unit - 2
No ratings yet
ML Unit - 2
70 pages
NNDL Umit 1 Important Questions
No ratings yet
NNDL Umit 1 Important Questions
8 pages
Machine Learning: Lecture 4: Artificial Neural Networks (Based On Chapter 4 of Mitchell T.., Machine Learning, 1997)
No ratings yet
Machine Learning: Lecture 4: Artificial Neural Networks (Based On Chapter 4 of Mitchell T.., Machine Learning, 1997)
14 pages
Backpropagation Learning in Neural Networks
No ratings yet
Backpropagation Learning in Neural Networks
27 pages
ANN Research
No ratings yet
ANN Research
18 pages
Neural Networks
No ratings yet
Neural Networks
29 pages
Deep Learning PDF
100% (1)
Deep Learning PDF
87 pages
Lecture 12 - Neural Networks (DONE!!) PDF
No ratings yet
Lecture 12 - Neural Networks (DONE!!) PDF
27 pages
Unit 5 ML
No ratings yet
Unit 5 ML
37 pages
Module 02
No ratings yet
Module 02
20 pages
Working of Multi-Layer Perceptron
No ratings yet
Working of Multi-Layer Perceptron
16 pages
CS217 2024 Lec11
No ratings yet
CS217 2024 Lec11
7 pages
3ML.05.NeuralNetworks DeepLearning
No ratings yet
3ML.05.NeuralNetworks DeepLearning
67 pages
2023 Lecture11 NeuralNetworks
No ratings yet
2023 Lecture11 NeuralNetworks
48 pages
Neural
No ratings yet
Neural
53 pages
13 Nnbasics
No ratings yet
13 Nnbasics
22 pages
Ca 3 DL
No ratings yet
Ca 3 DL
6 pages
TO Artificial Neural Networks
No ratings yet
TO Artificial Neural Networks
22 pages
Lecture 09 Slides - After
No ratings yet
Lecture 09 Slides - After
57 pages
CS224n: Natural Language Processing With Deep Learning
No ratings yet
CS224n: Natural Language Processing With Deep Learning
18 pages
Lab 10 - Neural Network
No ratings yet
Lab 10 - Neural Network
11 pages
Neural Networks
No ratings yet
Neural Networks
61 pages
2021 Lecture11 NeuralNetworks
No ratings yet
2021 Lecture11 NeuralNetworks
48 pages
CNN and Gan: Introduction To
No ratings yet
CNN and Gan: Introduction To
58 pages
Main
No ratings yet
Main
25 pages
Understanding Backpropagation Algorithm - Towards Data Science
No ratings yet
Understanding Backpropagation Algorithm - Towards Data Science
11 pages
UNIT 3 - Backpropagation Algorithm
No ratings yet
UNIT 3 - Backpropagation Algorithm
38 pages
DL 02 Deep Forward Networks
No ratings yet
DL 02 Deep Forward Networks
47 pages
Classification BP Regression KNN Other Classifiers - Final
No ratings yet
Classification BP Regression KNN Other Classifiers - Final
116 pages
Neural Network
100% (1)
Neural Network
54 pages
CS 329 Lecture4 2025new
No ratings yet
CS 329 Lecture4 2025new
61 pages
Lecture8,9-Neural Networks
No ratings yet
Lecture8,9-Neural Networks
65 pages
Types of MAC Protocols
No ratings yet
Types of MAC Protocols
16 pages
Artificial Intelligence: Outline
No ratings yet
Artificial Intelligence: Outline
35 pages
L10 Neural Network
No ratings yet
L10 Neural Network
52 pages
CV Lec5
No ratings yet
CV Lec5
54 pages
Machine Learning: Lecture 4: Artificial Neural Networks (Based On Chapter 4 of Mitchell T.., Machine Learning, 1997)
No ratings yet
Machine Learning: Lecture 4: Artificial Neural Networks (Based On Chapter 4 of Mitchell T.., Machine Learning, 1997)
14 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
Unit 2 v1.
No ratings yet
Unit 2 v1.
41 pages
Lecture 10 Neural Network
No ratings yet
Lecture 10 Neural Network
34 pages
Mid 1 DL Notes
No ratings yet
Mid 1 DL Notes
15 pages
Ece18898g Neural Networks
No ratings yet
Ece18898g Neural Networks
47 pages
5 - From Linear Models To Multi-Layer Perceptrons
No ratings yet
5 - From Linear Models To Multi-Layer Perceptrons
45 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
15 pages
Single Neuron Model
No ratings yet
Single Neuron Model
16 pages
DS303 NN
No ratings yet
DS303 NN
20 pages
AIML-Module-3-part 2
No ratings yet
AIML-Module-3-part 2
122 pages
Artificial life: Random walk
From Everand
Artificial life: Random walk
Mietek Szyszkowicz
No ratings yet
Deep Learning Fundamentals in Python
From Everand
Deep Learning Fundamentals in Python
LazyProgrammer
4/5 (9)
Hidden Line Removal: Unveiling the Invisible: Secrets of Computer Vision
From Everand
Hidden Line Removal: Unveiling the Invisible: Secrets of Computer Vision
Fouad Sabry
No ratings yet
Sheet 06
No ratings yet
Sheet 06
1 page
Learning From Data: 9: Regularization
No ratings yet
Learning From Data: 9: Regularization
37 pages
Talk Overview
No ratings yet
Talk Overview
9 pages
Learning From Data: 4: Bias Variance Tradeoff
No ratings yet
Learning From Data: 4: Bias Variance Tradeoff
24 pages
LfD00 2
No ratings yet
LfD00 2
5 pages
LfD01 2
No ratings yet
LfD01 2
45 pages
Roadmap A2 Videoscripts
No ratings yet
Roadmap A2 Videoscripts
10 pages
Welcome in Our English Course!
No ratings yet
Welcome in Our English Course!
12 pages
OceanStor Dorado 6.1.x HyperReplication Feature Guide For Block
No ratings yet
OceanStor Dorado 6.1.x HyperReplication Feature Guide For Block
286 pages
Stationary Waves
No ratings yet
Stationary Waves
3 pages
June 2019 Pure Shadow Paper 2
No ratings yet
June 2019 Pure Shadow Paper 2
13 pages
Santu CV Job Final (07!01!25)
No ratings yet
Santu CV Job Final (07!01!25)
10 pages
Nibha Dubey
No ratings yet
Nibha Dubey
5 pages
Chapter 3 Test Taha
No ratings yet
Chapter 3 Test Taha
3 pages
Introduction - 3 Topics: Airplane Design (Aerodynamic) Prof. E.G. Tulapurkara Chapter-1
No ratings yet
Introduction - 3 Topics: Airplane Design (Aerodynamic) Prof. E.G. Tulapurkara Chapter-1
38 pages
Netezza Analytics Transition Service Flyer
No ratings yet
Netezza Analytics Transition Service Flyer
2 pages
BST, S&I, and EI: Lab Manual
No ratings yet
BST, S&I, and EI: Lab Manual
28 pages
Statistics: Measures of Central Tendency
No ratings yet
Statistics: Measures of Central Tendency
13 pages
Koe088 Natural Language Processing 2023 24
No ratings yet
Koe088 Natural Language Processing 2023 24
2 pages
Logcat 1711449573996
No ratings yet
Logcat 1711449573996
27 pages
LECTURE 3 - Corporate Image
No ratings yet
LECTURE 3 - Corporate Image
10 pages
BA v2 Colibri Cctalk EN 1-0
No ratings yet
BA v2 Colibri Cctalk EN 1-0
48 pages
EBSCO-FullText-03 03 2025
No ratings yet
EBSCO-FullText-03 03 2025
12 pages
Ubc 2020 November Tembrevilla Gerald
No ratings yet
Ubc 2020 November Tembrevilla Gerald
253 pages
Agarwal Dhar 2014 Editorial Big Data Data Science and Analytics The Opportunity and Challenge For Is Research
No ratings yet
Agarwal Dhar 2014 Editorial Big Data Data Science and Analytics The Opportunity and Challenge For Is Research
6 pages
Appendix 1: Apmoption Apm3Rdpar Csotpm
No ratings yet
Appendix 1: Apmoption Apm3Rdpar Csotpm
8 pages
Electropneumatics Basic Level: Festo Worldwide
No ratings yet
Electropneumatics Basic Level: Festo Worldwide
34 pages
EAPP 12 2nd Quarter
No ratings yet
EAPP 12 2nd Quarter
23 pages
Smart Traffic Management Project
No ratings yet
Smart Traffic Management Project
2 pages
F&G Devices Inspection and Test Plan
No ratings yet
F&G Devices Inspection and Test Plan
3 pages
Untitled
No ratings yet
Untitled
6 pages
ZXA10 C320 Datasheet: Key Features Technical Specifications
No ratings yet
ZXA10 C320 Datasheet: Key Features Technical Specifications
3 pages
Experience Summary: Vijaya Bhaskar P
No ratings yet
Experience Summary: Vijaya Bhaskar P
3 pages
SensaGuard Switches With EStop To MSR138.1DP Relay
No ratings yet
SensaGuard Switches With EStop To MSR138.1DP Relay
4 pages
Application - Generator Protection
No ratings yet
Application - Generator Protection
13 pages

Learning From Data: 10: Neural Networks - I

Uploaded by

Learning From Data: 10: Neural Networks - I

Uploaded by

Learning From Data

10: Neural Networks – I

Wissen durch Praxis stärkt

The Four Backpropagation Equations

2/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

3/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

Neural Networks were inspired by biology

4/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

Input layer Hidden layer Hidden layer Output layer

5/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

6/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

0w1 + 0w2 + w0 ≤ 0 (1)

7/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

8/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

Backpropagation was introduced in early 60’s, see historical survey in

9/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

10/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

11/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

This can be compactly written as a matrix equation:

al = σ(wl al−1 + bl ) (6)

by introducing a matrix w of weights and bias and activation vectors b

We also define the weighted input zl of the l-th layer as

13/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

We can perform the optimization by a brute force approach:

However, for a million weights this becomes prohibitive.

15/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

For σ, the following functions are common:

with s := wa + b. Note, that the step function is not differentiable at zero.

17/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

18/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

5. As an alternative to the tanh function, sometimes the arctan function is

19/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

21/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

Note, that some activation functions are non differentiable at least

22/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

23/38 Jörg Schäfer | Learning From Data | c b na 10: Neural Networks – I

24/38 Jörg Schäfer | Learning From Data | c b na 10: Neural Networks – I

We often write the First Fundamental Equation of Backpropagation

where denotes the so-called Hadamard or Schur product of two vectors:

25/38 Jörg Schäfer | Learning From Data | c b na 10: Neural Networks – I

Now, ajL = σ(zjL ), thus

26/38 Jörg Schäfer | Learning From Data | c b na 10: Neural Networks – I

Lemma 4 (Error Recursion)

Suppose δ l+1 is known.

27/38 Jörg Schäfer | Learning From Data | c b na 10: Neural Networks – I

The First Two Fundamental Equation of Backpropagation enable a

28/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

X ∂z l+1 X ∂z l+1 ∂C X ∂C ∂z l+1 ∂C

Differentiating the z variables on the other hand yields:

Putting these together we get

29/38 Jörg Schäfer | Learning From Data | c b na 10: Neural Networks – I

30/38 Jörg Schäfer | Learning From Data | c b na 10: Neural Networks – I

32/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

33/38 Jörg Schäfer | Learning From Data | c b na 10: Neural Networks – I

Figure: Nothing but the chain rule

34/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

For the breakcondition one usually defines

36/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

Backpropagation is much more efficient than brute force as it

37/38 Jörg Schäfer | Learning From Data | c b na 10: Neural Networks – I

[Lic15] G. Licata, “Are neural networks imitations of mind?” Journal of

[RHW86] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning

[Roj96] R. Rojas, Neural Networks: A Systematic Introduction. Berlin,

[Sch14] J. Schmidhuber, “Deep learning in neural networks: An overview,”

38/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

You might also like