Learning From Data: 10: Neural Networks - I
Learning From Data: 10: Neural Networks - I
History
Definition
Backpropagation
Bibliography
w11
x1 w12
w21 y1
x2 w22
w31 y2
x3 w32
w41 y3
x4 w42
w51
The perceptron is a neural (network) with just one layer (node). It can
compute the logical functions:
AND
OR
However, it cannot compute the XOR function.
o +
From this we conclude
x1
y1
x2
y2
x3
y3
x4 2
w53
a12
x1
b11 a22 y1
x2
b21 a32 y2
x3
b31 a42 y3
x4
a52
zl = wl al−1 + bl (7)
Henceforth
al = σ(zl ) (8)
∂C C (wil + ) − C (wil )
l
:= lim
∂wi →0
σId (s) = s
with s := wa + b.
2
0.8
0.8
sigmoid(x)
1
step(x)
id(x)
0
0.4
0.4
0.0
0.0
−2
−10 0 5 10 −2 0 1 2 −10 0 5 10
x x x
1.0
0.0 1.0
8
tanh(x)
atan(x)
relu(x)
0.0
4
−1.0
−1.5
0
−10 0 5 10 −10 0 5 10 −10 0 5 10
20/38 Jörg Schäfer | Learning From Data | c b n a 10: Neural Networks – I
Partial Derivatives of Activation Functions
Note, that we need the partial derivatives of activation functions for the back
propagation algorithms. Fortunately, these are easy to compute:
∂σΘ (s)
= 0, s 6= 0 (9)
∂s
∂σId
=1 (10)
∂s
∂σsigmoid (s) 1
=− = σsigmoid (s)(1 − σsigmoid (s)) (11)
∂s (1 − e −s )2
∂σtanh (s) 1
= (12)
∂s 1 + s2
∂σarctan (s) 2
= 1 − σtanh (s) (13)
∂s (
∂σRELU (s) 0 if s < 0
= (14)
∂s 1 if s > 0
Definition 1
We define the error at layer l caused by the activation zjl as
∂C
δjl := .
∂zjl
∂C ∂σ(zjL ) ∂C
δjL = L L
= L σ 0 (zjL ) (15)
∂aj ∂zj ∂aj
Note, that
∂C
∂ajL
measures how fast the cost function changes as function of the jth
output – i.e. if C does not depend much on a neuron j, δjL will be small.
σ 0 (zjL ) measures how fast the activation function changes (steepness) at zjL .
2
For the quadratic cost function C = 12
y − aL
we have
∂C
= (ajL − yj ).
∂ajL
δ L = ∇a C σ 0 (zL ),
Definition 3
The Hadamard or Schur product of two vectors a and b is defined as the
point-wise product:
(a b)i := ai bi .
∂C ∂ajL
δjL = .
∂ajL ∂zjL
δ L = ∇a C σ 0 (zL ) (17)
δ l = (wl+1 )T δ l+1 σ 0 (zl ). (18)
First, compute δ L .
Second, compute δ L−1 .
Third, compute δ L−2 .
...
∂zkl+1
= wkjl+1 σ 0 (zjl ).
∂zjl
∂C
= δjl ⇐⇒ ∇bl C = δ l . (19)
∂bjl
Proof.
∂C ∂C ∂zjl
= = δjl 1 = δjl .
∂bjl ∂zjl ∂bjl
Proof.
∂C ∂C ∂zjl ∂C
l
= l l
= l akl−1 = δjl akl−1
∂wjk ∂zj ∂wjk ∂zj
As for any a = (a1 , . . . , an ), b = (b1 , . . . , bm ) we have a ⊗ b = (ai bj ),
and for any n × m matrix A we have
vec(Ai,j ) = (a1,1 , . . . , an,1 , a1,2 , . . . , an,2 , . . . , a1,m , . . . , an,m ) the second
equation is equivalent.
31/38 Jörg Schäfer | Learning From Data | c b na 10: Neural Networks – I
Fourth Fundamental Equation of Backpropagation (cont.)
From
∂C
= akl−1 δjl (21)
∂wjkl
we conclude:
If the activation akl−1 is small, so is the gradient ∂C
∂wjkl
.
Thus, weights output from low-activation neurons learn slowly.
σ becomes very flat for saturated neurons.
Thus, weights output for saturated neurons learn slowly.
zl = wl al−1 + bl
al = σ(zl )
∂C
δjl = ,
∂zjl
where C is any cost function satisfying the conditions defined above, the following
relations hold:
δ L = ∇a C σ 0 (zL ) (22)
l l+1 T l+1 0 l
δ = (w ) δ σ (z ) (23)
l
∇bl C = δ (24)
l l−1
∇wl C = vec(δ ⊗ a ). (25)
f10 (x )|f1 (x )
7
%
1
f10 (x ) + f20 (x ) 9 1 |+
'
f20 (x )|f2 (x )
o
backpropagation
Note, however that no method can guarantee to find the global minimum
– thus we risk of get stuck in a local minimum.