0% found this document useful (0 votes)

16 views55 pages

Artificial Neural Networks and Deep Learning

The document provides an overview of artificial neural networks and deep learning, covering topics such as threshold logic units, multi-layer perceptrons, and various training methods including gradient descent and backpropagation. It discusses mathematical foundations like regression, polynomial regression, and logistic regression, along with their applications in neural network training. The document also explains the structure and operation of different types of neural networks and their training processes.

Uploaded by

mahsa.kh.1980

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views55 pages

Artificial Neural Networks and Deep Learning

Uploaded by

mahsa.kh.1980

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

Artificial N eu r al Networks

and Deep Lear n in g

1
Contents

• Introduction
Motivation, Biological Background

• Th res h o l d L o g i c U n i t s
Definition, Geometric Interpretation, Limitations, Networks of T L U s , Training

• General N e u r a l Networks
Structure, Operation, Training

• M u l t i - layer Perceptrons
Definition, Function Approximation, Gradient Descent, Backpropagation, Variants, Sensitivity Analysis

• Deep Learn i n g
Many-layered Perceptrons, Rectified Linear Units, Auto-Encoders, Feature Construction, Image Analysis

• R a d i a l B a s i s Fu n ct ion Networks
Definition, Function Approximation, Initialization, Training, Generalized Version

• Self-Org an i zi n g Map s
Definition, Learning Vector Quantization, Neighborhood of Output Neurons

• Hopfield Networks and B o l t z m a n n Machines

Definition, Convergence, Associative Memory, Solving Optimization Problems, Probabilistic Models

• Recu rren t N e u r a l Networks

Differential Equations, Vector Networks, Backpropagation through Time

2
Mathematical Background: Regression

96
Mathematical Background: L i n e a r Regression

T r a i n i n g neural networks is closely related to regression.

Given: • A dataset ((x 1 , y1), . . . , (x n , y n )) of n data tuples and

• a hypothesis about the functional relationship, e.g. y = g(x) = a + bx.

Approach: Minimize the sum of squared errors, that is,

n n
F (a, b) = Σ (g(xi ) − y i )2 = Σ (a + bxi − y i) 2.
i=1 i=1

Necessary conditions for a minimum

(a.k.a. Fermat’s theorem, after Pierre de Fermat, 1601–1665):
n
∂F
= Σ 2(a + bxi − y i) = 0 and
∂a i=1
n
∂F = Σ 2(a + bx i − y i )x i = 0
∂b i=1

97
Mathematical Background: L i n e a r
Regression

Result of necessary conditions: System of so-called normal equations, that is,

• Two linear equations for two unknowns a and b.

• System can be solved with standard methods from linear algebra.
• Solution is unique unless all x-values are identical.
• The resulting line is called a regression line.

98
L i n e a r Regression: E x a m p l e

x 1 2 3 4 5 6 7 8
y 1 3 2 3 4 3 5 6

3 7
y= + x.
4 12
y
6
5
4
3
2
1
x
0
0 1 2 3 4 5 6 7 8

99
Mathematical Background: Po l y n o mi a l Regression

Generalization to polynomials

y = p(x) = a0 + a 1 x + . . . + a m x m

Approach: Minimize the sum of squared errors, that is,

n n
F (a0, a1, . . . , a m ) = Σ (p(xi ) −y i )2 = Σ (a0 i −y i )
+ a1xi + . . . + am x m 2
i=1 i=1

Necessary conditions for a minimum: All partial derivatives vanish, that is,

∂F ∂F ∂F
= 0, = 0, ... , = 0.
∂a0 ∂a1 ∂am

100
Mathematical Background: Po l y n o mi a l Regression

S y s t e m of normal equations for polynomials

• m + 1 linear equations for m + 1 unknowns a0, . . . , am.

• System can be solved with standard methods from linear algebra.
• Solution is unique unless the points lie exactly on a polynomial of lower degree.
101
Mathematical Background: Multilinear Regression

Generalization to more than one argument

z = f (x, y) = a + bx + cy
Approach: Minimize the sum of squared errors, that is,
n n
F (a, b, c) = Σ ( f (xi , yi ) − z i )2 = Σ (a + bxi + cyi − z i) 2
i=1 i=1
Necessary conditions for a minimum: All partial derivatives vanish, that is,

∂F n
= Σ 2(a + bxi + cyi − z i) = 0,
∂a i=1
∂F n
= Σ 2(a + bxi + cyi − z i)x i = 0,
∂b i=1
n
∂F
= Σ 2(a + bxi + cyi − z i)y i = 0.
∂c i=1

102
Mathematical Background: Multilinear Regression

S y s t e m of normal equations for several arguments

• 3 linear equations for 3 unknowns a, b, and c.

• System can be solved with standard methods from linear algebra.
• Solution is unique unless all data points lie on a straight line.

103
Mathematical Background: Logistic Regression

Generalization to non-poly nomial functions

Simple example: y = axb

Idea: Find transformation to linear/polynomial case.

Transformation for the above example: ln y = ln a + b ·ln x.

Special case: logistic function

Y ⇔ 1 = 1 + ea+bx Y −y
y= ⇔ = ea+bx .
1 + ea+bx y Y y

Result: Apply so-called L o g i t - Transformation

Y −y
ln = a + bx.
y

107
Logistic Regression: E x a m p l e

x 1 2 3 4 5
y 0.4 1.0 3.0 5.0 5.6
Transform the data with
Y −y
z = ln , Y = 6.
y
The transformed data points are
x 1 2 3 4 5
z 2.64 1.61 0.00 −1.61 −2.64

The resulting regression line and therefore the desired function are
6
z ≈ −1.3775x + 4.133 and y≈ .
1 + e−1.3775x+4.133
Attention: Note that the error is minimized only in the transformed space!
Therefore the function in the original space may not be optimal!

108
Logistic Regression: E x a m p l e

y Y = 6
4 6
3
5
2
1 4
x
0 3
−1 1 2 3 4 5
2
−2
1
−3 x
−4 0
0 1 2 3 4 5

The logistic regression function can be computed by a single neuron with

• network input function f net (x) ≡ wx with w ≈ −1.3775,

act (net, θ) ≡(1 + e−(net −θ) −1

• activation function f ) with θ ≈ 4.133 and
• output function fout(act) ≡ 6 act.

109
Tr ain in g M u l t i - layer Perceptrons

113
T r a i n i n g M u l t i - layer Perceptrons: G r a d i e n t Descent

• Problem of logistic regression: Works only for two-layer perceptrons.

• More general approach: gradient descent.
• Necessary condition: differentiable activation and output functions.

Illustration of the gradient of a real-valued function z = f (x, y) at a point (x 0 , y0).

It is →
(∇ is a differential operator called “nabla” or “del”.)

114
G r a d i e n t Descent: Fo r ma l A p p r o a c h

General Idea: Approach the minimum of the error function in small steps.

Error function:

Form gradient to determine the direction of the step

(here and in the following: extended weight vector w→u = (−θu, wup1, . . . , wupn)):

Exploit the sum over the training patterns:

115
G r a d i e n t Descent: Fo r ma l A p p r o a c h

Single pattern error depends on weights only through the network input:

Since ((note: extended input vector ),

we have for the second factor

For the ﬁrst factor we consider the error e (l) for the training pattern l = (→
ı (l) , →
o (l) ):

that is, the sum of the errors over all output neurons.

116
G r a d i e n t Descent: Fo r ma l A p p r o a c h

Therefore we have

Since only the actual output out(l)

v of an output neuron v depends on the network
(l)
input netu of the neuron u we are considering, it is

which also introduces the abbreviation δ(l)

u for the important sum appearing here.

117
G r a d i e n t Descent: Fo r ma l A p p r o a c h

Distinguish two cases: • The neuron u is an output neuron.

• The neuron u is a hidden neuron.
In the ﬁrst case we have

Therefore we have for the gradient

and thus for the weight change

118
G r a d i e n t Descent: Fo r ma l A p p r o a c h

Exact formulae depend on the choice of the activation and the output function,
since it is
(l) (l)
out(l)
u = f out ( actu ) = f out (f act ( netu )).

Consider the special case with

• output function is the identity,
• activation function is logistic, that is, f act (x) = 1+e1−x .

The ﬁrst assumption yields

119
G r a d i e n t Descent: Fo r ma l A p p r o a c h

For a logistic activation function we have

and therefore

The resulting weight change is therefore

which makes the computations very simple.

120
G r a d i e n t Descent: Fo r ma l A p p r o a c h

logistic activation function: derivative of logistic function:

1 ′ (net (l) ) = f
act (net u ) ·(1 − f act (net u ))
(l)
f act (net (l)
u ) = f act (l)
1 + e—net u
(l) u

1 1

1 1
2 2
1
4
net net
0 0
−4 −2 0 +2 +4 −4 −2 0 +2 +4

• If a logistic activation function is used (shown on left), the weight changes are
proportional to λ (l)
u
= outu(l)( 1 − out u(l)) (shown on right; see preceding slide).

• Weight changes are largest, and thus the training speed highest, in the vicinity
of netu(l) = 0. Far away from netu(l) = 0, the gradient becomes (very) small
(“saturation regions”) and thus training (very) slow.

121
E r r o r Backpropagation

Consider now: The neuron u is a hidden neuron, that is, u ∈ U k , 0 < k < r − 1.

(l) (l)
The output outv of an output neuron v depends on the network input netu
only indirectly through the successor neurons succ(u) = { s ∈ U | (u, s) ∈ C }
= {s 1 , . . . , s m } ⊆ U k+1 , namely through their network inputs nets(l).

We apply the chain rule to obtain

Exchanging the sums yields

123
E r r o r Backpropagation

Consider the network input

where one element of is the output of the neuron u. Therefore it is

The result is the recursive equation (error backpropagation)

124
E r r o r Backpropagation

The resulting formula for the weight change is

Consider again the special case with

• output function is the identity,
• activation function is logistic.

The resulting formula for the weight change is then

125
E r r o r Backpropagation: C o o k b o o k Recipe

126
G r a d i e n t Descent: E x a mp l e s

Grad i en t descent training for the negation ¬x

x y
w y
x θ 0 1
1 0

error for x = 0 error for x = 1 sum of errors

Note: error for x = 0 and x = 1 is effectively the squared logistic activation function!

127
G r a d i e n t Descent: E x a mp l e s

epoch θ w error epoch θ w error

0 3.00 3.50 1.307 0 3.00 3.50 1.295
20 3.77 2.19 0.986 20 3.76 2.20 0.985
40 3.71 1.81 0.970 40 3.70 1.82 0.970
60 3.50 1.53 0.958 60 3.48 1.53 0.957
80 3.15 1.24 0.937 80 3.11 1.25 0.934
100 2.57 0.88 0.890 100 2.49 0.88 0.880
120 1.48 0.25 0.725 120 1.27 0.22 0.676
140 −0.06 −0.98 0.331 140 −0.21 −1.04 0.292
160 −0.80 −2.07 0.149 160 −0.86 −2.08 0.140
180 −1.19 −2.74 0.087 180 −1.21 −2.74 0.084
200 −1.44 −3.20 0.059 200 −1.45 −3.19 0.058
220 −1.62 −3.54 0.044 220 −1.63 −3.53 0.044

Online Training Batch Training

128
G r a d i e n t Descent: E x a mp l e s

Visualization of gradient descent for the negation ¬x

4 4

2 2

w 0 w 0

−2 −2

−4 −4
−4 −2 0 2 4 −4 −2 0 2 4
θ θ
Online Training Batch Training Batch Training

• Training is obviously successful.

• Error cannot vanish completely due to the properties of the logistic function.

129
G r a d i e n t Descent: E x a mp l e s

5 115 2
Example function: f (x) = x 4 − 7x 3 + x − 18x + 6,
6 6

i xi f (x i ) f ′ (x i ) ∆xi
6
0 0.200 3.112 −11.147 0.011
1 0.211 2.990 −10.811 0.011 5
2 0.222 2.874 −10.490 0.010
3 0.232 2.766 −10.182 0.010 4
4 0.243 2.664 −9.888 0.010
−9.606 3
5 0.253 2.568 0.010
6 0.262 2.477 −9.335 0.009 2
7 0.271 2.391 −9.075 0.009
8 0.281 2.309 −8.825 0.009 1
9 0.289 2.233 −8.585 0.009 x
10 0.298 2.160 0
0 1 2 3 4

Gradient descent with initial value 0.2 and learning rate 0.001.

130
G r a d i e n t Descent: E x a mp l e s

5 115 2
Example function: f (x) = x 4 − 7x 3 + x − 18x + 6,
6 6

i xi f (x i ) f ′ (x i ) ∆xi
6
0 1.500 2.719 3.500 −0.875
1 0.625 0.655 −1.431 0.358 5
2 0.983 0.955 2.554 −0.639
3 0.344 1.801 −7.157 1.789 4
4 2.134 4.127 0.567 −0.142
1.380 −0.345
3
5 1.992 3.989
6 1.647 3.203 3.063 −0.766 2 starting point
7 0.881 0.734 1.753 −0.438
8 0.443 1.211 −4.851 1.213 1
9 1.656 3.231 3.029 −0.757 x
10 0.898 0.766 0
0 1 2 3 4

Gradient descent with initial value 1.5 and learning rate 0.25.

131
G r a d i e n t Descent: E x a mp l e s

5 115 2
Example function: f (x) = x 4 − 7x 3 + x − 18x + 6,
6 6

i xi f (x i ) f ′ (x i ) ∆xi
6
0 2.600 3.816 −1.707 0.085
1 2.685 3.660 −1.947 0.097 5
2 2.783 3.461 −2.116 0.106
3 2.888 3.233 −2.153 0.108 4
4 2.996 3.008 −2.009 0.100
−1.688 3
5 3.097 2.820 0.084
6 3.181 2.695 −1.263 0.063 2
7 3.244 2.628 −0.845 0.042
8 3.286 2.599 −0.515 0.026 1
9 3.312 2.589 −0.293 0.015 x
10 3.327 2.585 0
0 1 2 3 4

Gradient descent with initial value 2.6 and learning rate 0.05.

132
G r a d i e n t Descent: Variants

Weight update rule:

w(t + 1) = w(t) + ∆w(t)

S t a n d a r d backpropagation:
η
∆w(t) = − ∇we(t)
2

M a n h at t a n training:
∆w(t) = −η sgn(∇we(t)).

Fixed step width (grid), only sign of gradient (direction) is evaluated.

M o me n t u m term:
η
∆w(t) = − ∇we(t) + β ∆w(t − 1),
2
Part of previous change is added, may lead to accelerated training (β ∈ [0.5, 0.95]).

133
G r a d i e n t Descent: Variants

Self-adaptive error backpropagation:

Resilient error backpropagation:

Typical values: c− ∈ [0.5, 0.7] and c + ∈ [1.05, 1.2].

134
G r a d i e n t Descent: Variants

Quick propagation e

e(t−1)
apex
e(t)

w
m w(t +1 ) w(t) w(t−1)

∇w e ∇w e(t−1)

∇w e(t)

The weight update rule can be 0

derived from the triangles: w
w(t+1) w(t) w(t−1)
∇we(t)
∆w(t) = ·∆w(t − 1).
∇we(t − 1) − ∇we(t)

135
G r a d i e n t Descent: E x a mp l e s

epoch θ w error epoch θ w error

0 3.00 3.50 1.295 0 3.00 3.50 1.295
20 3.76 2.20 0.985 10 3.80 2.19 0.984
40 3.70 1.82 0.970 20 3.75 1.84 0.971
60 3.48 1.53 0.957 30 3.56 1.58 0.960
80 3.11 1.25 0.934 40 3.26 1.33 0.943
100 2.49 0.88 0.880 50 2.79 1.04 0.910
120 1.27 0.22 0.676 60 1.99 0.60 0.814
140 −0.21 −1.04 0.292 70 0.54 −0.25 0.497
160 −0.86 −2.08 0.140 80 −0.53 −1.51 0.211
180 −1.21 −2.74 0.084 90 −1.02 −2.36 0.113
200 −1.45 −3.19 0.058 100 −1.31 −2.92 0.073
220 −1.63 −3.53 0.044 110 −1.52 −3.31 0.053
120 −1.67 −3.61 0.041

without momentum term with momentum term (β = 0.9)

136
G r a d i e n t Descent: E x a mp l e s

4 4

2 2

w 0 w 0

−2 −2

−4 −4
−4 −2 0 2 4 −4 −2 0 2 4
θ θ
without momentum term with momentum term with momentum term

• Dots show position every 20 (without momentum term)

or every 10 epochs (with momentum term).
• Learning with a momentum term (β = 0.9) is about twice as fast.

137
G r a d i e n t Descent: E x a mp l e s

5 115 2
Example function: f (x) = x 4 − 7x 3 + x − 18x + 6,
6 6

i xi f (x i ) f ′ (x i ) ∆xi
6
0 0.200 3.112 −11.147 0.011
1 0.211 2.990 −10.811 0.021 5
2 0.232 2.771 −10.196 0.029
3 0.261 2.488 −9.368 0.035 4
4 0.296 2.173 −8.397 0.040
3
5 0.337 1.856 −7.348 0.044
6 0.380 1.559 −6.277 0.046 2
7 0.426 1.298 −5.228 0.046
8 0.472 1.079 −4.235 0.046 1
9 0.518 0.907 −3.319 0.045 x
10 0.562 0.777 0
0 1 2 3 4

Gradient descent with initial value 0.2, learning rate 0.001

and momentum term β = 0.9.

138
G r a d i e n t Descent: E x a mp l e s

5 115 2
Example function: f (x) = x 4 − 7x 3 + x − 18x + 6,
6 6

i xi f (x i ) f ′ (x i ) ∆xi
6
0 1.500 2.719 3.500 −1.050
1 0.450 1.178 −4.699 0.705 5
2 1.155 1.476 3.396 −0.509
3 0.645 0.629 −1.110 0.083 4
4 0.729 0.587 0.072 −0.005
3
5 0.723 0.587 0.001 0.000
6 0.723 0.587 0.000 0.000 2
7 0.723 0.587 0.000 0.000
8 0.723 0.587 0.000 0.000 1
9 0.723 0.587 0.000 0.000 x
10 0.723 0.587 0
0 1 2 3 4

Gradient descent with initial value 1.5, inital learning rate 0.25,
and self-adapting learning rate (c + = 1.2, c− = 0.5).

139
O t h e r Extensions of E r r o r Backpropagation

F l a t S p o t Elimination:
η
∆w(t) = − ∇we(t) + ζ
2
• Eliminates slow learning in saturation region of logistic function (ζ ≈ 0.1).
• Counteracts the decay of the error signals over the layers.

Weight Decay :
η
∆w(t) = − ∇we(t) − ξ w(t),
2

• Helps to improve the robustness of the training results (ξ ≤ 10−3).

• Can be derived from an extended error function penalizing large weights:
ξ
e∗ = e + Σ θu2 + Σ w2up .
2 u∈U ∪U p∈pred(u)
out hidden

140
N u m b e r of H i d d e n Neurons

• Note that the approximation theorem only states that there exists
a number of hidden neurons and weight vectors → →i and thresholds θi,
v and w
but not how they are to be chosen for a given ε of approximation accuracy.
• For a single hidden layer the following rule of t h u mb is popular:
number of hidden neurons = (number of inputs + number of outputs) / 2

• Better, though computationally expensive approach:

◦ Randomly split the given data into two subsets of (about) equal size,
the training data and the validation data.
◦ Train multi-layer perceptrons with different numbers of hidden neurons
on the training data and evaluate them on the validation data.
◦ Repeat the random split of the data and training/evaluation many times
and average the results over the same number of hidden neurons.
Choose the number of hidden neurons with the best average error.
◦ Train a ﬁnal multi-layer perceptron on the whole data set.

141
N u m b e r of H i d d e n Neurons

Pri n ci p l e of training data/validation data approach:

• Underfitting : If the number of neurons in the hidden layer is too small, the
multi-layer perceptron may not be able to capture the structure of the relationship
between inputs and outputs precisely enough due to a lack of parameters.

• Overfitting: With a larger number of hidden neurons a multi-layer perceptron

may adapt not only to the regular dependence between inputs and outputs, but
also to the accidental speciﬁcs (errors and deviations) of the training data set.

• Overﬁtting will usually lead to the effect that the error a multi-layer perceptron
yields on the validation data will be (possibly considerably) greater than the error
it yields on the training data.
The reason is that the validation data set is likely distorted in a different fashion
than the training data, since the errors and deviations are random.

• Minimizing the error on the validation data by properly choosing the number of
hidden neurons prevents both under- and overﬁtting.

142
N u m b e r of H i d d e n Neurons: A v o i d Overfitting

• Objective: select the model that best ﬁts the data,

t ak i n g the model complexity into account.
The more complex the model, the better it usually ﬁts the data.

y
7 black line:
6 regression line
5 (2 free parameters)
4
3 blue curve:
2 7th order regression polynomial
1 (8 free parameters)
x
0
0 1 2 3 4 5 6 7 8

• The blue curve ﬁts the data points perfectly, b u t it is not a good model.

143
N u m b e r of H i d d e n Neurons: C r o s s Validation

• The described method of iteratively splitting the data into training and validation
data may be referred to as cross validation.

• However, this term is more often used for the following speciﬁc procedure:
◦ The given data set is split into n parts or subsets (also called folds)
of about equal size (so-called n-fold cross validation).
◦ If the output is nominal (also sometimes called symbolic or categorical), this
split is done in such a way that the relative frequency of the output values in
the subsets/folds represent as well as possible the relative frequencies of these
values in the data set as a whole. This is also called stratification (derived
from the Latin stratum: layer, level, tier).
◦ Out of these n data subsets (or folds) n pairs of training and validation data
set are formed by using one fold as a validation data set while the remaining
n − 1 folds are combined into a training data set.

144
N u m b e r of H i d d e n Neurons: C r o s s Validation

• The advantage of the cross validation method is that one random split of the data
yields n different pairs of training and validation data set.

• An obvious disadvantage is that (except for n = 2) the size of the training and the
test data set are considerably different, which makes the results on the validation
data statistically less reliable.

• It is therefore only recommended for sufficiently large data sets or sufficiently

small n, so that the validation data sets are of sufficient size.

• Repeating the split (either with n = 2 or greater n) has the advantage that one
obtains many more training and validation data sets, leading to more reliable
statistics (here: for the number of hidden neurons).

• The described approaches fall into the category of resampling methods.

• Other well-known statistical resampling methods are bootstrap, jackknife,
subsampling and permutation test.

145
A v o i d i n g Overfitting: Alternatives

• An alternative way to prevent overﬁtting is the following approach:

◦ During training the performance of the multi-layer perceptron is evaluated
after each epoch (or every few epochs) on a validation data set.
◦ While the error on the training data set should always decrease with each
epoch, the error on the validation data set should, after decreasing initially as
well, increase again as soon as overﬁtting sets in.
◦ At this moment training is terminated and either the current state or (if avail-
able) the state of the multi-layer perceptron, for which the error on the vali-
dation data reached a minimum, is reported as the training result.

• Furthermore a stopping criterion may be derived from the shape of the error curve
on the training data over the training epochs, or the network is trained only for a
ﬁxed, relatively small number of epochs (also known as early stopping).

• Disadvantage: these methods stop the training of a complex network early enough,
rather than adjust the complexity of the network to the “correct” level.

146
Sensitivity A n aly s is

147
Sensitivity A n a l y si s

P r o b l e m of M u l t i - Lay er Perceptrons:
• The knowledge that is learned by a neural network
is encoded in matrices/vectors of real-valued numbers and
is therefore often difficult to understand or to extract.

• Geometric interpretation or other forms of intuitive understanding are possible

only for very simple networks, but fail for complex practical problems.

• Thus a neural network is often effectively a black box,

that computes its output, though in a mathematically precise way,
in a way that is very difficult to understand and to interpret.

Idea of Sensitivity Analy sis:

• Try to ﬁnd out to which inputs the output(s) react(s) most sensitively.

• This may also give hints which inputs are not needed and may be discarded.

148
Sensitivity A n a l y si s

Question: How important are different inputs to the network?

Idea: Determine change of output relative to change of input.

1 Σ Σ (l
∂ outv)
∀u ∈ Uin : s(u) = .
|Lﬁxed| l∈L (l)
fixed v∈Uout ∂ ext u

Formal derivation: Apply chain rule.

∂ outv ∂ outv ∂ outu ∂ outv ∂ netv ∂ outu
= = .
∂ extu ∂ outu ∂ extu ∂ netv ∂ outu ∂ extu

Simpliﬁcation: Assume that the output function is the identity.

∂ outu
= 1.
∂ extu

149
Sensitivity A n a l y si s

For the second factor we get the general result:

∂ netv ∂ Σ Σ ∂ outp
= wvp outp = wvp .
∂ outu ∂ outu p∈pred(v) p∈pred(v)
∂ outu

This leads to the recursion formula

∂ outv ∂ outv ∂ netv ∂ outv Σ ∂ outp
= = wvp .
∂ outu ∂ netv ∂ outu ∂ netv p∈pred(v) ∂ outu

However, for the ﬁrst hidden layer we get

∂ netv ∂ outv ∂ outv

= wvu, therefore = wvu.
∂ outu ∂ outu ∂ netv

This formula marks the start of the recursion.

150
Sensitivity A n a l y si s

Consider as usual the special case with

• output function is the identity,
• activation function is logistic.

The recursion formula is in this case

∂ out v Σ ∂ outp
= outv(1 − outv) wvp
∂ outu p∈pred(v)
∂ outu

and the anchor of the recursion is

∂ out v
= outv(1 − outv)wvu.
∂ outu

151
Sensitivity A n a l y si s

Attention: Us e weight decay to stabilize the training results!

η
∆w(t) = − ∇we(t) − ξ w(t),
2

• Without weight decay the results of a sensitivity analysis

can depend (strongly) on the (random) initialization of the network.

E x a mp l e : Iris data, 1 hidden layer, 3 hidden neurons, 1000 epochs

attribute ξ= 0 ξ = 0.0001
sepal length 0.0216 0.0399 0.0232 0.0515 0.0367 0.0325 0.0351 0.0395
sepal width 0.0423 0.0341 0.0460 0.0447 0.0385 0.0376 0.0421 0.0425
petal length 0.1789 0.2569 0.1974 0.2805 0.2048 0.1928 0.1838 0.1861
petal width 0.2017 0.1356 0.2198 0.1325 0.2020 0.1962 0.1750 0.1743

152
Demonstration Software: x m l p / w m l p

Demonstration of multi-layer perceptron training:

• Visualization of the training process
• Biimplication and Exclusive Or, two continuous
functions
• http://www.borgelt.net/mlpd.html

166
M u l t i - Layer Perceptron Software: ml p / ml p g u i

Software for training general multi-layer perceptrons:

• Command line version written in C, fast training
• Graphical user interface in Java, easy to use
• http://www.borgelt.net/mlp.html
http://www.borgelt.net/mlpgui.html

167

Practical Ict Book
No ratings yet
Practical Ict Book
204 pages
LTE Power Control
100% (2)
LTE Power Control
34 pages
Eem520l3 2023
No ratings yet
Eem520l3 2023
25 pages
RAC MCQs-180-set-01 V2
No ratings yet
RAC MCQs-180-set-01 V2
24 pages
Backpropagation Example
No ratings yet
Backpropagation Example
9 pages
Airtech Busch Parts
No ratings yet
Airtech Busch Parts
7 pages
ANN Example
No ratings yet
ANN Example
10 pages
Back Propogation
No ratings yet
Back Propogation
9 pages
Presentation 1
No ratings yet
Presentation 1
14 pages
Neural Network (Perceptrons)
No ratings yet
Neural Network (Perceptrons)
31 pages
Machine Learning Shortnote
No ratings yet
Machine Learning Shortnote
14 pages
Backpropagation in Neural Nets
No ratings yet
Backpropagation in Neural Nets
13 pages
NN Lecture Notes
No ratings yet
NN Lecture Notes
45 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
DL U-I Introduction Part-2
No ratings yet
DL U-I Introduction Part-2
48 pages
Hands-On Exercise No. 1 Batch-02 Graphic Design Total Marks: 10 Due Date: 04/08/2022
No ratings yet
Hands-On Exercise No. 1 Batch-02 Graphic Design Total Marks: 10 Due Date: 04/08/2022
3 pages
04-NN Training GoodF
No ratings yet
04-NN Training GoodF
82 pages
Machine_learning_45_a_87
No ratings yet
Machine_learning_45_a_87
43 pages
26 Neural Nets
No ratings yet
26 Neural Nets
77 pages
Fuzzy Logic: Introduction: Debasis Samanta
No ratings yet
Fuzzy Logic: Introduction: Debasis Samanta
69 pages
ML807 Distributed and Federated Learning Slides 2
No ratings yet
ML807 Distributed and Federated Learning Slides 2
211 pages
DeepLearning Introduction
No ratings yet
DeepLearning Introduction
14 pages
Neural Network - Optimization DRAFT 3.11
No ratings yet
Neural Network - Optimization DRAFT 3.11
66 pages
7 - Feedforward and Backpropagation
No ratings yet
7 - Feedforward and Backpropagation
55 pages
Neural Networks and Fuzzy Systems: Multi-Layer Feed Forward Networks
No ratings yet
Neural Networks and Fuzzy Systems: Multi-Layer Feed Forward Networks
27 pages
Neural Network Training
No ratings yet
Neural Network Training
73 pages
1.1 Introduction
No ratings yet
1.1 Introduction
73 pages
Chapter 2 - 2 Shallow Neural Network 2 - 2
No ratings yet
Chapter 2 - 2 Shallow Neural Network 2 - 2
34 pages
Slides 11
No ratings yet
Slides 11
48 pages
Linear - Regression - SGD
No ratings yet
Linear - Regression - SGD
71 pages
L04 Slides - mlp1
No ratings yet
L04 Slides - mlp1
22 pages
Neural Net 3rdclass
No ratings yet
Neural Net 3rdclass
35 pages
Gradient Maths - Step by Step Delta Rule PDF
No ratings yet
Gradient Maths - Step by Step Delta Rule PDF
18 pages
Unit 2
No ratings yet
Unit 2
36 pages
Annette Paper
No ratings yet
Annette Paper
7 pages
Elektra Micro Casa A Leva Coffee Machine
No ratings yet
Elektra Micro Casa A Leva Coffee Machine
12 pages
EE769 7 Introduction To Neural Networks
No ratings yet
EE769 7 Introduction To Neural Networks
52 pages
Mod 2.4,2.5,2.6 Architecture Design
No ratings yet
Mod 2.4,2.5,2.6 Architecture Design
20 pages
Activation Function To Back Pro
No ratings yet
Activation Function To Back Pro
22 pages
3.2
No ratings yet
3.2
7 pages
cs188 Fa24 Lec24
No ratings yet
cs188 Fa24 Lec24
46 pages
(IJCST-V6I4P17) :P T V Lakshmi
No ratings yet
(IJCST-V6I4P17) :P T V Lakshmi
4 pages
SH - Fall of Troy Semi Fiction PDF
No ratings yet
SH - Fall of Troy Semi Fiction PDF
11 pages
Neural Networks Essay Feranmi Dere
No ratings yet
Neural Networks Essay Feranmi Dere
7 pages
Aishwarya H S: Career Objective
No ratings yet
Aishwarya H S: Career Objective
2 pages
Neural Networks Skimmed - Ipynb - Colab
No ratings yet
Neural Networks Skimmed - Ipynb - Colab
8 pages
Neural Networks
No ratings yet
Neural Networks
29 pages
LogisticRegression ExercisesSolutions
No ratings yet
LogisticRegression ExercisesSolutions
5 pages
Feedforward Networks: Marco Kuhlmann
No ratings yet
Feedforward Networks: Marco Kuhlmann
53 pages
Lab 3
No ratings yet
Lab 3
40 pages
2-Mathematical Optimization and Deep Learning
No ratings yet
2-Mathematical Optimization and Deep Learning
53 pages
Backpropogation Algorithm
No ratings yet
Backpropogation Algorithm
48 pages
Neural Networks: Derivation: 1 Model
No ratings yet
Neural Networks: Derivation: 1 Model
9 pages
ABHA M1 API Document V1 R1.bab8b1bd
No ratings yet
ABHA M1 API Document V1 R1.bab8b1bd
33 pages
John Snow: Site Supervisor/Solar Installer
No ratings yet
John Snow: Site Supervisor/Solar Installer
3 pages
Neural Networks - Learning
No ratings yet
Neural Networks - Learning
26 pages
Neural Network
No ratings yet
Neural Network
14 pages
Lecture20 Backprop
No ratings yet
Lecture20 Backprop
77 pages
W2 Ann
No ratings yet
W2 Ann
12 pages
Unit 5-dld Notes (Pranalini)
No ratings yet
Unit 5-dld Notes (Pranalini)
16 pages
1 s2.0 S0196890421011778 Main
No ratings yet
1 s2.0 S0196890421011778 Main
12 pages
Eio Supplementary
No ratings yet
Eio Supplementary
6 pages
L3 Backpropagation
No ratings yet
L3 Backpropagation
61 pages
HODL Lec 2 Training NNs Intro TF
No ratings yet
HODL Lec 2 Training NNs Intro TF
83 pages
Logistic Regression
No ratings yet
Logistic Regression
51 pages
Lecture Slides 2 - Neural Networks - 2021
No ratings yet
Lecture Slides 2 - Neural Networks - 2021
42 pages
Part 1.1.neural Network and Training Algorithm
No ratings yet
Part 1.1.neural Network and Training Algorithm
34 pages
7-1. Check Valves (2011)
No ratings yet
7-1. Check Valves (2011)
8 pages
Presentation Eurl Afaq
No ratings yet
Presentation Eurl Afaq
9 pages
Exercises On Backpropagation
No ratings yet
Exercises On Backpropagation
4 pages
Homologation
No ratings yet
Homologation
20 pages
Backpropagation LectureNotesPublic
No ratings yet
Backpropagation LectureNotesPublic
13 pages
Shell Diala Ingles
No ratings yet
Shell Diala Ingles
2 pages
CJ MCQ
No ratings yet
CJ MCQ
98 pages
Biometrics Lec1 MCQ
No ratings yet
Biometrics Lec1 MCQ
10 pages
Swanti Satsangi
No ratings yet
Swanti Satsangi
1 page
Assignment 1
No ratings yet
Assignment 1
4 pages
Designing Input Filter
No ratings yet
Designing Input Filter
31 pages
Computer Science 2
No ratings yet
Computer Science 2
24 pages
An Efficient Index For Contact Tracing Query in A Large Spatio - Temporal DB
No ratings yet
An Efficient Index For Contact Tracing Query in A Large Spatio - Temporal DB
22 pages
New Flipkart VRP Box To Box (1918 PCS) @ (19-06-2025)
No ratings yet
New Flipkart VRP Box To Box (1918 PCS) @ (19-06-2025)
292 pages
Portable Percent Oxygen Analyzer With USB Data Logging
No ratings yet
Portable Percent Oxygen Analyzer With USB Data Logging
1 page
Er14505 Aa Forte Lithium Batteries
No ratings yet
Er14505 Aa Forte Lithium Batteries
1 page
Statistical Approach To Quality Management
No ratings yet
Statistical Approach To Quality Management
57 pages
Hall 4
No ratings yet
Hall 4
1 page
Number System Conversion
No ratings yet
Number System Conversion
30 pages