0% found this document useful (0 votes)
59 views38 pages

Learning From Data: 10: Neural Networks - I

The document discusses the history and definition of neural networks. It introduces backpropagation, which is a method for training neural networks by calculating the gradient of the loss function with respect to the network's weights. The document explains the four equations of backpropagation and provides context on neural networks and their biological inspiration.

Uploaded by

Hamed Aryanfar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views38 pages

Learning From Data: 10: Neural Networks - I

The document discusses the history and definition of neural networks. It introduces backpropagation, which is a method for training neural networks by calculating the gradient of the loss function with respect to the network's weights. The document explains the four equations of backpropagation and provides context on neural networks and their biological inspiration.

Uploaded by

Hamed Aryanfar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Learning From Data

10: Neural Networks – I


Jörg Schäfer
Frankfurt University of Applied Sciences
Department of Computer Sciences
Nibelungenplatz 1
D-60318 Frankfurt am Main

Wissen durch Praxis stärkt


1/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I Summer Semester 2022
Content

History

Definition

Backpropagation

The Four Backpropagation Equations

Bibliography

2/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I


Practicing is not only playing your instrument, either by yourself
or rehearsing with others - it also includes imagining yourself
practicing. Your brain forms the same neural connections and
muscle memory whether you are imagining the task or actually
doing it.
– Yo-Yo Ma
Everybody right now, they look at the current technology, and
they think, ’OK, that’s what artificial neural nets are.’ And they
don’t realize how arbitrary it is. We just made it up! And there’s
no reason why we shouldn’t make up something else.
– Geoffrey Hinton

3/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I


Neural Networks and Neurons

Neural Networks were inspired by biology


Neurons are connected to many other neurons
Neurons process information using action potentials
The brain is capable of processing complex computing tasks using a
network of neurons
For a good exposition of the biological background, see the book by
Raúl Rojas [Roj96].
Shallow Analogy
Be aware, that the analogy is shallow, see [Lic15].

4/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I


Neural Networks – General Architecture

Input layer Hidden layer Hidden layer Output layer

w11
x1 w12
w21 y1
x2 w22
w31 y2
x3 w32
w41 y3
x4 w42
w51

5/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I


Perceptron – Recap

The perceptron is a neural (network) with just one layer (node). It can
compute the logical functions:
AND
OR
However, it cannot compute the XOR function.

6/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I


Perceptron – XOR

o +
From this we conclude

0w1 + 0w2 + w0 ≤ 0 (1)


+ o
0w1 + 1w2 + w0 > 0 (2)
1w1 + 0w2 + w0 > 0 (3)
1w1 + 1w2 + w0 ≤ 0 (4)
x y class
0 0 0 However, these inequalities contradict
each other.
0 1 1
1 0 1
1 1 0

7/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I


Excursion: Neural Networks as a General Computing
Paradigm
The theory of computations has a long history:
John von Neumann defined (in ‘First Draft of a Report on the
EDVA” 1945) the von Neumann architecture we still use today
for our machines.
He also introduced a new computational model which he called
cellular automata (1940s together with Stanislav Ulam).
However, already in the 1930s Alonzo Church introduced another
model for computations, so-called recursive functions and lambda
calculus. It is at the heart of functional languages still.
The field of neural networks started with researchers like McCulloch,
Wiener, and von Neumann (1950s). Neural Network could be
thought fo alternative building blocks for computations.

8/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I


History

Backpropagation was introduced in early 60’s, see historical survey in


[Sch14].
However, it was unappreciated until 1986 paper [RHW86] by David
Rumelhart, Geoffrey Hinton, and Ronald Williams.
In this paper, several neural networks were described where
backpropagation works far faster than earlier learning approaches.
Today, backpropagation is the standard methodology to train neural
networks.

9/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I


Computing Activations
Let us define wjkl as the weight for the connection from the k-th neuron
in the (l − 1) layer to the j-th neuron in the l-th layer:
Input layer Hidden layer 1 Hidden layer 2 Output layer

x1
y1
x2
y2
x3
y3
x4 2
w53

10/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I


Computing Activations (cont.)
Let us define bjl as the bias of the j-th neuron in the l-th layer.
Let us define ajl as the activation (output) of the j-th neuron in
the l-th layer.
Input layer Hidden layer 1 Hidden layer 2 Output layer

a12
x1
b11 a22 y1
x2
b21 a32 y2
x3
b31 a42 y3
x4
a52

11/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I


Computing Activations (cont.)
The activation ajl of the jth neuron in the l-th layer can be computed
from the activations1 in the (l − 1) layer:
!
wjkl akl−1
X
ajl =σ + bjl (5)
k

This can be compactly written as a matrix equation:

al = σ(wl al−1 + bl ) (6)

by introducing a matrix w of weights and bias and activation vectors b


and a, resp.
Note, that we implicitly use a vectorized version of σ,
i.e. σ(u) := (σ(u1 ), . . . , σ(un )).
1
we define activation functions in the sequel – just assume any function for now
12/38 Jörg Schäfer | Learning From Data | c b na 10: Neural Networks – I
Computing Activations (cont.)

We also define the weighted input zl of the l-th layer as

zl = wl al−1 + bl (7)
Henceforth
al = σ(zl ) (8)

13/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I


Learning Strategy
Strategy:
We like to optimize the cost function C by chosing the right weights
w.
∂C ∂C
In order to do so, we compute partial derivatives ∂w and ∂b with
respect to any weight w and any bias b in the network.
Then we can iteratively correct the weights by
∂C
w ←w −η
∂w
and the bias’ by
∂C
, b ←b−η
∂w
where η ∈ R+ is the so-called learning rate.
Note that this is a version of (stochastic) gradient descent optimization.
14/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I
Brute Force

We can perform the optimization by a brute force approach:


Compute ∇C numerically.
This can be easily done as

∂C C (wil + ) − C (wil )
l
:= lim
∂wi →0 

However, for a million weights this becomes prohibitive.


Thus, we need a better approach!

15/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I


Cost Functions
Backpropagation works for any cost function C satisfying these
assumptions:
1. C is differentiable
2. The cost function can be written as an average over cost functions
of individual training examples – this is important for (stochastic)
gradient descent
3. The cost function can be written as a function of the neural network
outputs.
The standard cost function for regression fulfills 2 and 3 of the above:
1 2
C= y − aL .

2
If the activation function σ is differentiable, C is differentiable, too and
henceforth fulfills 1 above.
16/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I
Activation Functions

For σ, the following functions are common:


1. The step function is defined as
(
0 if s ≤ 0
σΘ (s) := Θ(s) :=
1 if s > 0

with s := wa + b. Note, that the step function is not differentiable at zero.


2. The identity or linear function:

σId (s) = s

17/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I


Activation Functions (cont.)
3. The sigmoid function is defined as
es 1
σsigmoid (s) := sigmoid(s) := =
s
e +1 1 + e −s
with s := wa + b. Note that σ ∈ (0, 1) and is differentiable. A small change
in weights or bias will produce a small change in the output. Towards either
end of the sigmoid function the value of s makes a very small change in
σ(s) and this is called vanishing gradients.
4. The tanh function is defined as
2
σtanh (s) := tanh(s) = − 1 = 2 sigmoid(2s) − 1
1 + e −2s
with s := wa + b. Note that sigmoid ∈ (0, 1) and is differentiable. Note -
although the gradient is steeper as compared to sigmoid that tanh also
suffers from vanishing gradients.

18/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I


Activation Functions (cont.)

5. As an alternative to the tanh function, sometimes the arctan function is


defined as
σarctanh := arctanh(s) := tan−1 (s)
with s := wa + b.
6. Another common (non-differentiable) activation function is the so-called
REctified Linear Unit (RELU):
(
0 if s ≤ 0
σRELU (s) :=
s if s > 0

with s := wa + b.

19/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I


Activation Functions (cont.)

2
0.8

0.8
sigmoid(x)
1
step(x)

id(x)

0
0.4

0.4
0.0

0.0
−2
−10 0 5 10 −2 0 1 2 −10 0 5 10

x x x
1.0

0.0 1.0

8
tanh(x)

atan(x)

relu(x)
0.0

4
−1.0

−1.5

0
−10 0 5 10 −10 0 5 10 −10 0 5 10
20/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I
Partial Derivatives of Activation Functions
Note, that we need the partial derivatives of activation functions for the back
propagation algorithms. Fortunately, these are easy to compute:

∂σΘ (s)
= 0, s 6= 0 (9)
∂s
∂σId
=1 (10)
∂s
∂σsigmoid (s) 1
=− = σsigmoid (s)(1 − σsigmoid (s)) (11)
∂s (1 − e −s )2
∂σtanh (s) 1
= (12)
∂s 1 + s2
∂σarctan (s) 2
= 1 − σtanh (s) (13)
∂s (
∂σRELU (s) 0 if s < 0
= (14)
∂s 1 if s > 0

21/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I


Non Differentiable Activation Functions

Note, that some activation functions are non differentiable at least


at certain points.
Application of the theory below including optimization using
(stochastic) gradient descent requires differentiability in all points.
Practically, this is usually no problem as functions are differentiable
except in single points.
Thus, one usually does not “hit” these and/or sub-derviatives can be
used to prove theorems and ensure convergence and estimates.

22/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I


Errors

Definition 1
We define the error at layer l caused by the activation zjl as

∂C
δjl := .
∂zjl

23/38 Jörg Schäfer | Learning From Data | c b na 10: Neural Networks – I


First Fundamental Equation of Backpropagation
Lemma 2 (Error in the output layer)
For the output layer, the following is true:

∂C ∂σ(zjL ) ∂C
δjL = L L
= L σ 0 (zjL ) (15)
∂aj ∂zj ∂aj

Note, that
∂C
∂ajL
measures how fast the cost function changes as function of the jth
output – i.e. if C does not depend much on a neuron j, δjL will be small.
σ 0 (zjL ) measures how fast the activation function changes (steepness) at zjL .
2
For the quadratic cost function C = 12 y − aL we have
∂C
= (ajL − yj ).
∂ajL

24/38 Jörg Schäfer | Learning From Data | c b na 10: Neural Networks – I


First Fundamental Equation of Backpropagation (cont.)

We often write the First Fundamental Equation of Backpropagation


in matrix form as follows:

δ L = ∇a C σ 0 (zL ),

where denotes the so-called Hadamard or Schur product of two vectors:

Definition 3
The Hadamard or Schur product of two vectors a and b is defined as the
point-wise product:
(a b)i := ai bi .

25/38 Jörg Schäfer | Learning From Data | c b na 10: Neural Networks – I


First Fundamental Equation of Backpropagation: Proof
Proof.
By definition
∂C
δjL := .
∂zjL
Hence, applying the chain rule

∂C ∂ajL
δjL = .
∂ajL ∂zjL

Now, ajL = σ(zjL ), thus


∂C 0 L
δjL = σ (zj ).
∂ajL

26/38 Jörg Schäfer | Learning From Data | c b na 10: Neural Networks – I


Second Fundamental Equation of Backpropagation

Lemma 4 (Error Recursion)


The error in the l-th layer can be computed from the error in the (l + 1) layer as
follows  
δ l = (wl+1 )T δ l+1 σ 0 (zl ). (16)

Suppose δ l+1 is known.


Applying the matrix multiplication (wl+1 )T to it moves the error one layer
backward through the network. It can be interpreted as a measure of the
error at the output of the l-th layer.
By taking the Hadamard product σ 0 (zl ) we move the error backward
through the activation function in layer l.

27/38 Jörg Schäfer | Learning From Data | c b na 10: Neural Networks – I


The First Two Fundamental Equation of Backpropagation

The First Two Fundamental Equation of Backpropagation enable a


recursion:

δ L = ∇a C σ 0 (zL ) (17)
 
δ l = (wl+1 )T δ l+1 σ 0 (zl ). (18)

First, compute δ L .
Second, compute δ L−1 .
Third, compute δ L−2 .
...

28/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I


Second Fundamental Equation of Backpropagation – Proof
Proof.
By definition, we have δjl := ∂C
∂zjl
. On the other hand, using the chain rule to simplify
∂z l+1
δkl+1 we get
P k
the expression k ∂zjl

X ∂z l+1 X ∂z l+1 ∂C X ∂C ∂z l+1 ∂C


k
δkl+1 = k
= k
= = δjl .
∂zjl ∂zjl ∂zkl+1 ∂zkl+1 ∂zjl ∂zjl
k k k

Differentiating the z variables on the other hand yields:

∂zkl+1
= wkjl+1 σ 0 (zjl ).
∂zjl

Putting these together we get


X
δjl = wkjl+1 δkl+1 σ 0 (zjl ) δ l = (wl+1 )T δ l+1 σ 0 (zl )

i.e.
k

29/38 Jörg Schäfer | Learning From Data | c b na 10: Neural Networks – I


Third Fundamental Equation of Backpropagation
We also can compute easily the rate of change of the cost C w.r.t. any
bias in the network:
Lemma 5 (Bias)

∂C
= δjl ⇐⇒ ∇bl C = δ l . (19)
∂bjl

Proof.

∂C ∂C ∂zjl
= = δjl 1 = δjl .
∂bjl ∂zjl ∂bjl

30/38 Jörg Schäfer | Learning From Data | c b na 10: Neural Networks – I


Fourth Fundamental Equation of Backpropagation
Lemma 6 (Rate of change of the cost)
The rate of change of the cost w.r.t. any weight is
∂C
= akl−1 δjl ⇐⇒ ∇wl C = vec(δ l ⊗ al−1 ). (20)
∂wjkl

Proof.

∂C ∂C ∂zjl ∂C
l
= l l
= l akl−1 = δjl akl−1
∂wjk ∂zj ∂wjk ∂zj
As for any a = (a1 , . . . , an ), b = (b1 , . . . , bm ) we have a ⊗ b = (ai bj ),
and for any n × m matrix A we have
vec(Ai,j ) = (a1,1 , . . . , an,1 , a1,2 , . . . , an,2 , . . . , a1,m , . . . , an,m ) the second
equation is equivalent.
31/38 Jörg Schäfer | Learning From Data | c b na 10: Neural Networks – I
Fourth Fundamental Equation of Backpropagation (cont.)

From
∂C
= akl−1 δjl (21)
∂wjkl
we conclude:
If the activation akl−1 is small, so is the gradient ∂C
∂wjkl
.
Thus, weights output from low-activation neurons learn slowly.
σ becomes very flat for saturated neurons.
Thus, weights output for saturated neurons learn slowly.

32/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I


Summary
The four fundamental equations of backpropagation
For a neural network with

zl = wl al−1 + bl
al = σ(zl )
∂C
δjl = ,
∂zjl

where C is any cost function satisfying the conditions defined above, the following
relations hold:

δ L = ∇a C σ 0 (zL ) (22)
l l+1 T l+1 0 l

δ = (w ) δ σ (z ) (23)
l
∇bl C = δ (24)
l l−1
∇wl C = vec(δ ⊗ a ). (25)

33/38 Jörg Schäfer | Learning From Data | c b na 10: Neural Networks – I


Backpropagation – Intuition

f10 (x )|f1 (x )
7

%
1
f10 (x ) + f20 (x ) 9 1 |+

'
f20 (x )|f2 (x )

o
backpropagation

Figure: Nothing but the chain rule

34/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I


The Backpropagation Algorithm
Algorithm 1 B ac k p ro pag at i o n(x )
1: Set input activation a1 .
2: while Breakcondition not met do
3: for l ∈ [0, . . . L] do
4: Feedforward Pass: Compute zl = wl al−1 + bl and σ(zl )
5: end for
6: Output Error: Compute δ L = ∇a C σ 0 (zL )
7: for l ∈ [L − 1, . . . 0] do
8: Backpropagation Pass: Backpropagate the error using
9: δ L = ∇a C σ 0 (zL )and
10: δ l = (wl+1 )T δ l+1 σ 0 (zl )
11: Update Errors: Apply gradient descent by updating the weights
12: and bias as wl ← wl − η∇wl C and bl ← bl − η∇bl C .
13: end for
14: end while
35/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I
The Backpropagation Algorithm (cont.)

For the breakcondition one usually defines


Fixed limit of iterations
A limit on changes of the weights
A limit on changes of the errors
A limit on cross validation (expensive)

36/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I


Backpropagation Efficiency

Backpropagation is much more efficient than brute force as it


requires only two passes (forward and backward) for each iteration.
Being able to quickly calculate the derivatives of the activation
functions is important.

Note, however that no method can guarantee to find the global minimum
– thus we risk of get stuck in a local minimum.

37/38 Jörg Schäfer | Learning From Data | c b na 10: Neural Networks – I


References I

[Lic15] G. Licata, “Are neural networks imitations of mind?” Journal of


Computer Science & Systems Biology (Comput Sci Syst Biol
8), 2015.

[RHW86] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning


representations by back-propagating errors,” Nature, vol. 323, pp.
533–, Oct. 1986. [Online]. Available:
http://dx.doi.org/10.1038/323533a0

[Roj96] R. Rojas, Neural Networks: A Systematic Introduction. Berlin,


Heidelberg: Springer-Verlag, 1996.

[Sch14] J. Schmidhuber, “Deep learning in neural networks: An overview,”


CoRR, vol. abs/1404.7828, 2014.

38/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I

You might also like