0% found this document useful (0 votes)
16 views55 pages

Artificial Neural Networks and Deep Learning

The document provides an overview of artificial neural networks and deep learning, covering topics such as threshold logic units, multi-layer perceptrons, and various training methods including gradient descent and backpropagation. It discusses mathematical foundations like regression, polynomial regression, and logistic regression, along with their applications in neural network training. The document also explains the structure and operation of different types of neural networks and their training processes.

Uploaded by

mahsa.kh.1980
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views55 pages

Artificial Neural Networks and Deep Learning

The document provides an overview of artificial neural networks and deep learning, covering topics such as threshold logic units, multi-layer perceptrons, and various training methods including gradient descent and backpropagation. It discusses mathematical foundations like regression, polynomial regression, and logistic regression, along with their applications in neural network training. The document also explains the structure and operation of different types of neural networks and their training processes.

Uploaded by

mahsa.kh.1980
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Artificial N eu r al Networks

and Deep Lear n in g

1
Contents

• Introduction
Motivation, Biological Background

• Th res h o l d L o g i c U n i t s
Definition, Geometric Interpretation, Limitations, Networks of T L U s , Training

• General N e u r a l Networks
Structure, Operation, Training

• M u l t i - layer Perceptrons
Definition, Function Approximation, Gradient Descent, Backpropagation, Variants, Sensitivity Analysis

• Deep Learn i n g
Many-layered Perceptrons, Rectified Linear Units, Auto-Encoders, Feature Construction, Image Analysis

• R a d i a l B a s i s Fu n ct ion Networks
Definition, Function Approximation, Initialization, Training, Generalized Version

• Self-Org an i zi n g Map s
Definition, Learning Vector Quantization, Neighborhood of Output Neurons

• Hopfield Networks and B o l t z m a n n Machines


Definition, Convergence, Associative Memory, Solving Optimization Problems, Probabilistic Models

• Recu rren t N e u r a l Networks


Differential Equations, Vector Networks, Backpropagation through Time

2
Mathematical Background: Regression

96
Mathematical Background: L i n e a r Regression

T r a i n i n g neural networks is closely related to regression.

Given: • A dataset ((x 1 , y1), . . . , (x n , y n )) of n data tuples and


• a hypothesis about the functional relationship, e.g. y = g(x) = a + bx.

Approach: Minimize the sum of squared errors, that is,


n n
F (a, b) = Σ (g(xi ) − y i )2 = Σ (a + bxi − y i) 2.
i=1 i=1

Necessary conditions for a minimum


(a.k.a. Fermat’s theorem, after Pierre de Fermat, 1601–1665):
n
∂F
= Σ 2(a + bxi − y i) = 0 and
∂a i=1
n
∂F = Σ 2(a + bx i − y i )x i = 0
∂b i=1

97
Mathematical Background: L i n e a r
Regression

Result of necessary conditions: System of so-called normal equations, that is,

• Two linear equations for two unknowns a and b.


• System can be solved with standard methods from linear algebra.
• Solution is unique unless all x-values are identical.
• The resulting line is called a regression line.

98
L i n e a r Regression: E x a m p l e

x 1 2 3 4 5 6 7 8
y 1 3 2 3 4 3 5 6

3 7
y= + x.
4 12
y
6
5
4
3
2
1
x
0
0 1 2 3 4 5 6 7 8

99
Mathematical Background: Po l y n o mi a l Regression

Generalization to polynomials

y = p(x) = a0 + a 1 x + . . . + a m x m

Approach: Minimize the sum of squared errors, that is,


n n
F (a0, a1, . . . , a m ) = Σ (p(xi ) −y i )2 = Σ (a0 i −y i )
+ a1xi + . . . + am x m 2
i=1 i=1

Necessary conditions for a minimum: All partial derivatives vanish, that is,

∂F ∂F ∂F
= 0, = 0, ... , = 0.
∂a0 ∂a1 ∂am

100
Mathematical Background: Po l y n o mi a l Regression

S y s t e m of normal equations for polynomials

• m + 1 linear equations for m + 1 unknowns a0, . . . , am.


• System can be solved with standard methods from linear algebra.
• Solution is unique unless the points lie exactly on a polynomial of lower degree.
101
Mathematical Background: Multilinear Regression

Generalization to more than one argument

z = f (x, y) = a + bx + cy
Approach: Minimize the sum of squared errors, that is,
n n
F (a, b, c) = Σ ( f (xi , yi ) − z i )2 = Σ (a + bxi + cyi − z i) 2
i=1 i=1
Necessary conditions for a minimum: All partial derivatives vanish, that is,

∂F n
= Σ 2(a + bxi + cyi − z i) = 0,
∂a i=1
∂F n
= Σ 2(a + bxi + cyi − z i)x i = 0,
∂b i=1
n
∂F
= Σ 2(a + bxi + cyi − z i)y i = 0.
∂c i=1

102
Mathematical Background: Multilinear Regression

S y s t e m of normal equations for several arguments

• 3 linear equations for 3 unknowns a, b, and c.


• System can be solved with standard methods from linear algebra.
• Solution is unique unless all data points lie on a straight line.

103
Mathematical Background: Logistic Regression

Generalization to non-poly nomial functions

Simple example: y = axb


Idea: Find transformation to linear/polynomial case.

Transformation for the above example: ln y = ln a + b ·ln x.

Special case: logistic function


Y ⇔ 1 = 1 + ea+bx Y −y
y= ⇔ = ea+bx .
1 + ea+bx y Y y

Result: Apply so-called L o g i t - Transformation

Y −y
ln = a + bx.
y

107
Logistic Regression: E x a m p l e

x 1 2 3 4 5
y 0.4 1.0 3.0 5.0 5.6
Transform the data with
Y −y
z = ln , Y = 6.
y
The transformed data points are
x 1 2 3 4 5
z 2.64 1.61 0.00 −1.61 −2.64

The resulting regression line and therefore the desired function are
6
z ≈ −1.3775x + 4.133 and y≈ .
1 + e−1.3775x+4.133
Attention: Note that the error is minimized only in the transformed space!
Therefore the function in the original space may not be optimal!

108
Logistic Regression: E x a m p l e

y Y = 6
4 6
3
5
2
1 4
x
0 3
−1 1 2 3 4 5
2
−2
1
−3 x
−4 0
0 1 2 3 4 5

The logistic regression function can be computed by a single neuron with


• network input function f net (x) ≡ wx with w ≈ −1.3775,

act (net, θ) ≡(1 + e−(net −θ) −1


• activation function f ) with θ ≈ 4.133 and
• output function fout(act) ≡ 6 act.

109
Tr ain in g M u l t i - layer Perceptrons

113
T r a i n i n g M u l t i - layer Perceptrons: G r a d i e n t Descent

• Problem of logistic regression: Works only for two-layer perceptrons.


• More general approach: gradient descent.
• Necessary condition: differentiable activation and output functions.

Illustration of the gradient of a real-valued function z = f (x, y) at a point (x 0 , y0).


It is →
(∇ is a differential operator called “nabla” or “del”.)

114
G r a d i e n t Descent: Fo r ma l A p p r o a c h

General Idea: Approach the minimum of the error function in small steps.

Error function:

Form gradient to determine the direction of the step


(here and in the following: extended weight vector w→u = (−θu, wup1, . . . , wupn)):

Exploit the sum over the training patterns:

115
G r a d i e n t Descent: Fo r ma l A p p r o a c h

115
G r a d i e n t Descent: Fo r ma l A p p r o a c h

Single pattern error depends on weights only through the network input:

Since ((note: extended input vector ),


we have for the second factor

For the first factor we consider the error e (l) for the training pattern l = (→
ı (l) , →
o (l) ):

that is, the sum of the errors over all output neurons.

116
G r a d i e n t Descent: Fo r ma l A p p r o a c h

Therefore we have

Since only the actual output out(l)


v of an output neuron v depends on the network
(l)
input netu of the neuron u we are considering, it is

which also introduces the abbreviation δ(l)


u for the important sum appearing here.

117
G r a d i e n t Descent: Fo r ma l A p p r o a c h

Distinguish two cases: • The neuron u is an output neuron.


• The neuron u is a hidden neuron.
In the first case we have

Therefore we have for the gradient

and thus for the weight change

118
G r a d i e n t Descent: Fo r ma l A p p r o a c h

Exact formulae depend on the choice of the activation and the output function,
since it is
(l) (l)
out(l)
u = f out ( actu ) = f out (f act ( netu )).

Consider the special case with


• output function is the identity,
• activation function is logistic, that is, f act (x) = 1+e1−x .

The first assumption yields

119
G r a d i e n t Descent: Fo r ma l A p p r o a c h

For a logistic activation function we have

and therefore

The resulting weight change is therefore

which makes the computations very simple.

120
G r a d i e n t Descent: Fo r ma l A p p r o a c h

logistic activation function: derivative of logistic function:


1 ′ (net (l) ) = f
act (net u ) ·(1 − f act (net u ))
(l)
f act (net (l)
u ) = f act (l)
1 + e—net u
(l) u

1 1

1 1
2 2
1
4
net net
0 0
−4 −2 0 +2 +4 −4 −2 0 +2 +4

• If a logistic activation function is used (shown on left), the weight changes are
proportional to λ (l)
u
= outu(l)( 1 − out u(l)) (shown on right; see preceding slide).

• Weight changes are largest, and thus the training speed highest, in the vicinity
of netu(l) = 0. Far away from netu(l) = 0, the gradient becomes (very) small
(“saturation regions”) and thus training (very) slow.

121
E r r o r Backpropagation

Consider now: The neuron u is a hidden neuron, that is, u ∈ U k , 0 < k < r − 1.

(l) (l)
The output outv of an output neuron v depends on the network input netu
only indirectly through the successor neurons succ(u) = { s ∈ U | (u, s) ∈ C }
= {s 1 , . . . , s m } ⊆ U k+1 , namely through their network inputs nets(l).

We apply the chain rule to obtain

Exchanging the sums yields

123
E r r o r Backpropagation

Consider the network input

where one element of is the output of the neuron u. Therefore it is

The result is the recursive equation (error backpropagation)

124
E r r o r Backpropagation

The resulting formula for the weight change is

Consider again the special case with


• output function is the identity,
• activation function is logistic.

The resulting formula for the weight change is then

125
E r r o r Backpropagation: C o o k b o o k Recipe

126
G r a d i e n t Descent: E x a mp l e s

Grad i en t descent training for the negation ¬x

x y
w y
x θ 0 1
1 0

error for x = 0 error for x = 1 sum of errors

Note: error for x = 0 and x = 1 is effectively the squared logistic activation function!

127
G r a d i e n t Descent: E x a mp l e s

epoch θ w error epoch θ w error


0 3.00 3.50 1.307 0 3.00 3.50 1.295
20 3.77 2.19 0.986 20 3.76 2.20 0.985
40 3.71 1.81 0.970 40 3.70 1.82 0.970
60 3.50 1.53 0.958 60 3.48 1.53 0.957
80 3.15 1.24 0.937 80 3.11 1.25 0.934
100 2.57 0.88 0.890 100 2.49 0.88 0.880
120 1.48 0.25 0.725 120 1.27 0.22 0.676
140 −0.06 −0.98 0.331 140 −0.21 −1.04 0.292
160 −0.80 −2.07 0.149 160 −0.86 −2.08 0.140
180 −1.19 −2.74 0.087 180 −1.21 −2.74 0.084
200 −1.44 −3.20 0.059 200 −1.45 −3.19 0.058
220 −1.62 −3.54 0.044 220 −1.63 −3.53 0.044

Online Training Batch Training

128
G r a d i e n t Descent: E x a mp l e s

Visualization of gradient descent for the negation ¬x

4 4

2 2

w 0 w 0

−2 −2

−4 −4
−4 −2 0 2 4 −4 −2 0 2 4
θ θ
Online Training Batch Training Batch Training

• Training is obviously successful.


• Error cannot vanish completely due to the properties of the logistic function.

129
G r a d i e n t Descent: E x a mp l e s

5 115 2
Example function: f (x) = x 4 − 7x 3 + x − 18x + 6,
6 6

i xi f (x i ) f ′ (x i ) ∆xi
6
0 0.200 3.112 −11.147 0.011
1 0.211 2.990 −10.811 0.011 5
2 0.222 2.874 −10.490 0.010
3 0.232 2.766 −10.182 0.010 4
4 0.243 2.664 −9.888 0.010
−9.606 3
5 0.253 2.568 0.010
6 0.262 2.477 −9.335 0.009 2
7 0.271 2.391 −9.075 0.009
8 0.281 2.309 −8.825 0.009 1
9 0.289 2.233 −8.585 0.009 x
10 0.298 2.160 0
0 1 2 3 4

Gradient descent with initial value 0.2 and learning rate 0.001.

130
G r a d i e n t Descent: E x a mp l e s

5 115 2
Example function: f (x) = x 4 − 7x 3 + x − 18x + 6,
6 6

i xi f (x i ) f ′ (x i ) ∆xi
6
0 1.500 2.719 3.500 −0.875
1 0.625 0.655 −1.431 0.358 5
2 0.983 0.955 2.554 −0.639
3 0.344 1.801 −7.157 1.789 4
4 2.134 4.127 0.567 −0.142
1.380 −0.345
3
5 1.992 3.989
6 1.647 3.203 3.063 −0.766 2 starting point
7 0.881 0.734 1.753 −0.438
8 0.443 1.211 −4.851 1.213 1
9 1.656 3.231 3.029 −0.757 x
10 0.898 0.766 0
0 1 2 3 4

Gradient descent with initial value 1.5 and learning rate 0.25.

131
G r a d i e n t Descent: E x a mp l e s

5 115 2
Example function: f (x) = x 4 − 7x 3 + x − 18x + 6,
6 6

i xi f (x i ) f ′ (x i ) ∆xi
6
0 2.600 3.816 −1.707 0.085
1 2.685 3.660 −1.947 0.097 5
2 2.783 3.461 −2.116 0.106
3 2.888 3.233 −2.153 0.108 4
4 2.996 3.008 −2.009 0.100
−1.688 3
5 3.097 2.820 0.084
6 3.181 2.695 −1.263 0.063 2
7 3.244 2.628 −0.845 0.042
8 3.286 2.599 −0.515 0.026 1
9 3.312 2.589 −0.293 0.015 x
10 3.327 2.585 0
0 1 2 3 4

Gradient descent with initial value 2.6 and learning rate 0.05.

132
G r a d i e n t Descent: Variants

Weight update rule:


w(t + 1) = w(t) + ∆w(t)

S t a n d a r d backpropagation:
η
∆w(t) = − ∇we(t)
2

M a n h at t a n training:
∆w(t) = −η sgn(∇we(t)).

Fixed step width (grid), only sign of gradient (direction) is evaluated.

M o me n t u m term:
η
∆w(t) = − ∇we(t) + β ∆w(t − 1),
2
Part of previous change is added, may lead to accelerated training (β ∈ [0.5, 0.95]).

133
G r a d i e n t Descent: Variants

Self-adaptive error backpropagation:

Resilient error backpropagation:

Typical values: c− ∈ [0.5, 0.7] and c + ∈ [1.05, 1.2].

134
G r a d i e n t Descent: Variants

Quick propagation e

e(t−1)
apex
e(t)

w
m w(t +1 ) w(t) w(t−1)

∇w e ∇w e(t−1)

∇w e(t)

The weight update rule can be 0


derived from the triangles: w
w(t+1) w(t) w(t−1)
∇we(t)
∆w(t) = ·∆w(t − 1).
∇we(t − 1) − ∇we(t)

135
G r a d i e n t Descent: E x a mp l e s

epoch θ w error epoch θ w error


0 3.00 3.50 1.295 0 3.00 3.50 1.295
20 3.76 2.20 0.985 10 3.80 2.19 0.984
40 3.70 1.82 0.970 20 3.75 1.84 0.971
60 3.48 1.53 0.957 30 3.56 1.58 0.960
80 3.11 1.25 0.934 40 3.26 1.33 0.943
100 2.49 0.88 0.880 50 2.79 1.04 0.910
120 1.27 0.22 0.676 60 1.99 0.60 0.814
140 −0.21 −1.04 0.292 70 0.54 −0.25 0.497
160 −0.86 −2.08 0.140 80 −0.53 −1.51 0.211
180 −1.21 −2.74 0.084 90 −1.02 −2.36 0.113
200 −1.45 −3.19 0.058 100 −1.31 −2.92 0.073
220 −1.63 −3.53 0.044 110 −1.52 −3.31 0.053
120 −1.67 −3.61 0.041

without momentum term with momentum term (β = 0.9)

136
G r a d i e n t Descent: E x a mp l e s

4 4

2 2

w 0 w 0

−2 −2

−4 −4
−4 −2 0 2 4 −4 −2 0 2 4
θ θ
without momentum term with momentum term with momentum term

• Dots show position every 20 (without momentum term)


or every 10 epochs (with momentum term).
• Learning with a momentum term (β = 0.9) is about twice as fast.

137
G r a d i e n t Descent: E x a mp l e s

5 115 2
Example function: f (x) = x 4 − 7x 3 + x − 18x + 6,
6 6

i xi f (x i ) f ′ (x i ) ∆xi
6
0 0.200 3.112 −11.147 0.011
1 0.211 2.990 −10.811 0.021 5
2 0.232 2.771 −10.196 0.029
3 0.261 2.488 −9.368 0.035 4
4 0.296 2.173 −8.397 0.040
3
5 0.337 1.856 −7.348 0.044
6 0.380 1.559 −6.277 0.046 2
7 0.426 1.298 −5.228 0.046
8 0.472 1.079 −4.235 0.046 1
9 0.518 0.907 −3.319 0.045 x
10 0.562 0.777 0
0 1 2 3 4

Gradient descent with initial value 0.2, learning rate 0.001


and momentum term β = 0.9.

138
G r a d i e n t Descent: E x a mp l e s

5 115 2
Example function: f (x) = x 4 − 7x 3 + x − 18x + 6,
6 6

i xi f (x i ) f ′ (x i ) ∆xi
6
0 1.500 2.719 3.500 −1.050
1 0.450 1.178 −4.699 0.705 5
2 1.155 1.476 3.396 −0.509
3 0.645 0.629 −1.110 0.083 4
4 0.729 0.587 0.072 −0.005
3
5 0.723 0.587 0.001 0.000
6 0.723 0.587 0.000 0.000 2
7 0.723 0.587 0.000 0.000
8 0.723 0.587 0.000 0.000 1
9 0.723 0.587 0.000 0.000 x
10 0.723 0.587 0
0 1 2 3 4

Gradient descent with initial value 1.5, inital learning rate 0.25,
and self-adapting learning rate (c + = 1.2, c− = 0.5).

139
O t h e r Extensions of E r r o r Backpropagation

F l a t S p o t Elimination:
η
∆w(t) = − ∇we(t) + ζ
2
• Eliminates slow learning in saturation region of logistic function (ζ ≈ 0.1).
• Counteracts the decay of the error signals over the layers.

Weight Decay :
η
∆w(t) = − ∇we(t) − ξ w(t),
2

• Helps to improve the robustness of the training results (ξ ≤ 10−3).


• Can be derived from an extended error function penalizing large weights:
ξ
e∗ = e + Σ θu2 + Σ w2up .
2 u∈U ∪U p∈pred(u)
out hidden

140
N u m b e r of H i d d e n Neurons

• Note that the approximation theorem only states that there exists
a number of hidden neurons and weight vectors → →i and thresholds θi,
v and w
but not how they are to be chosen for a given ε of approximation accuracy.
• For a single hidden layer the following rule of t h u mb is popular:
number of hidden neurons = (number of inputs + number of outputs) / 2

• Better, though computationally expensive approach:


◦ Randomly split the given data into two subsets of (about) equal size,
the training data and the validation data.
◦ Train multi-layer perceptrons with different numbers of hidden neurons
on the training data and evaluate them on the validation data.
◦ Repeat the random split of the data and training/evaluation many times
and average the results over the same number of hidden neurons.
Choose the number of hidden neurons with the best average error.
◦ Train a final multi-layer perceptron on the whole data set.

141
N u m b e r of H i d d e n Neurons

Pri n ci p l e of training data/validation data approach:


• Underfitting : If the number of neurons in the hidden layer is too small, the
multi-layer perceptron may not be able to capture the structure of the relationship
between inputs and outputs precisely enough due to a lack of parameters.

• Overfitting: With a larger number of hidden neurons a multi-layer perceptron


may adapt not only to the regular dependence between inputs and outputs, but
also to the accidental specifics (errors and deviations) of the training data set.

• Overfitting will usually lead to the effect that the error a multi-layer perceptron
yields on the validation data will be (possibly considerably) greater than the error
it yields on the training data.
The reason is that the validation data set is likely distorted in a different fashion
than the training data, since the errors and deviations are random.

• Minimizing the error on the validation data by properly choosing the number of
hidden neurons prevents both under- and overfitting.

142
N u m b e r of H i d d e n Neurons: A v o i d Overfitting

• Objective: select the model that best fits the data,


t ak i n g the model complexity into account.
The more complex the model, the better it usually fits the data.

y
7 black line:
6 regression line
5 (2 free parameters)
4
3 blue curve:
2 7th order regression polynomial
1 (8 free parameters)
x
0
0 1 2 3 4 5 6 7 8

• The blue curve fits the data points perfectly, b u t it is not a good model.

143
N u m b e r of H i d d e n Neurons: C r o s s Validation

• The described method of iteratively splitting the data into training and validation
data may be referred to as cross validation.

• However, this term is more often used for the following specific procedure:
◦ The given data set is split into n parts or subsets (also called folds)
of about equal size (so-called n-fold cross validation).
◦ If the output is nominal (also sometimes called symbolic or categorical), this
split is done in such a way that the relative frequency of the output values in
the subsets/folds represent as well as possible the relative frequencies of these
values in the data set as a whole. This is also called stratification (derived
from the Latin stratum: layer, level, tier).
◦ Out of these n data subsets (or folds) n pairs of training and validation data
set are formed by using one fold as a validation data set while the remaining
n − 1 folds are combined into a training data set.

144
N u m b e r of H i d d e n Neurons: C r o s s Validation

• The advantage of the cross validation method is that one random split of the data
yields n different pairs of training and validation data set.

• An obvious disadvantage is that (except for n = 2) the size of the training and the
test data set are considerably different, which makes the results on the validation
data statistically less reliable.

• It is therefore only recommended for sufficiently large data sets or sufficiently


small n, so that the validation data sets are of sufficient size.

• Repeating the split (either with n = 2 or greater n) has the advantage that one
obtains many more training and validation data sets, leading to more reliable
statistics (here: for the number of hidden neurons).

• The described approaches fall into the category of resampling methods.


• Other well-known statistical resampling methods are bootstrap, jackknife,
subsampling and permutation test.

145
A v o i d i n g Overfitting: Alternatives

• An alternative way to prevent overfitting is the following approach:


◦ During training the performance of the multi-layer perceptron is evaluated
after each epoch (or every few epochs) on a validation data set.
◦ While the error on the training data set should always decrease with each
epoch, the error on the validation data set should, after decreasing initially as
well, increase again as soon as overfitting sets in.
◦ At this moment training is terminated and either the current state or (if avail-
able) the state of the multi-layer perceptron, for which the error on the vali-
dation data reached a minimum, is reported as the training result.

• Furthermore a stopping criterion may be derived from the shape of the error curve
on the training data over the training epochs, or the network is trained only for a
fixed, relatively small number of epochs (also known as early stopping).

• Disadvantage: these methods stop the training of a complex network early enough,
rather than adjust the complexity of the network to the “correct” level.

146
Sensitivity A n aly s is

147
Sensitivity A n a l y si s

P r o b l e m of M u l t i - Lay er Perceptrons:
• The knowledge that is learned by a neural network
is encoded in matrices/vectors of real-valued numbers and
is therefore often difficult to understand or to extract.

• Geometric interpretation or other forms of intuitive understanding are possible


only for very simple networks, but fail for complex practical problems.

• Thus a neural network is often effectively a black box,


that computes its output, though in a mathematically precise way,
in a way that is very difficult to understand and to interpret.

Idea of Sensitivity Analy sis:


• Try to find out to which inputs the output(s) react(s) most sensitively.

• This may also give hints which inputs are not needed and may be discarded.

148
Sensitivity A n a l y si s

Question: How important are different inputs to the network?


Idea: Determine change of output relative to change of input.

1 Σ Σ (l
∂ outv)
∀u ∈ Uin : s(u) = .
|Lfixed| l∈L (l)
fixed v∈Uout ∂ ext u

Formal derivation: Apply chain rule.


∂ outv ∂ outv ∂ outu ∂ outv ∂ netv ∂ outu
= = .
∂ extu ∂ outu ∂ extu ∂ netv ∂ outu ∂ extu

Simplification: Assume that the output function is the identity.


∂ outu
= 1.
∂ extu

149
Sensitivity A n a l y si s

For the second factor we get the general result:

∂ netv ∂ Σ Σ ∂ outp
= wvp outp = wvp .
∂ outu ∂ outu p∈pred(v) p∈pred(v)
∂ outu

This leads to the recursion formula


∂ outv ∂ outv ∂ netv ∂ outv Σ ∂ outp
= = wvp .
∂ outu ∂ netv ∂ outu ∂ netv p∈pred(v) ∂ outu

However, for the first hidden layer we get

∂ netv ∂ outv ∂ outv


= wvu, therefore = wvu.
∂ outu ∂ outu ∂ netv

This formula marks the start of the recursion.

150
Sensitivity A n a l y si s

Consider as usual the special case with


• output function is the identity,
• activation function is logistic.

The recursion formula is in this case


∂ out v Σ ∂ outp
= outv(1 − outv) wvp
∂ outu p∈pred(v)
∂ outu

and the anchor of the recursion is


∂ out v
= outv(1 − outv)wvu.
∂ outu

151
Sensitivity A n a l y si s

Attention: Us e weight decay to stabilize the training results!


η
∆w(t) = − ∇we(t) − ξ w(t),
2

• Without weight decay the results of a sensitivity analysis


can depend (strongly) on the (random) initialization of the network.

E x a mp l e : Iris data, 1 hidden layer, 3 hidden neurons, 1000 epochs

attribute ξ= 0 ξ = 0.0001
sepal length 0.0216 0.0399 0.0232 0.0515 0.0367 0.0325 0.0351 0.0395
sepal width 0.0423 0.0341 0.0460 0.0447 0.0385 0.0376 0.0421 0.0425
petal length 0.1789 0.2569 0.1974 0.2805 0.2048 0.1928 0.1838 0.1861
petal width 0.2017 0.1356 0.2198 0.1325 0.2020 0.1962 0.1750 0.1743

152
Demonstration Software: x m l p / w m l p

Demonstration of multi-layer perceptron training:


• Visualization of the training process
• Biimplication and Exclusive Or, two continuous
functions
• http://www.borgelt.net/mlpd.html

166
M u l t i - Layer Perceptron Software: ml p / ml p g u i

Software for training general multi-layer perceptrons:


• Command line version written in C, fast training
• Graphical user interface in Java, easy to use
• http://www.borgelt.net/mlp.html
http://www.borgelt.net/mlpgui.html

167

You might also like