0% found this document useful (0 votes)

27 views11 pages

Nonlinearity in Structural Dynamics Chapter App G

Uploaded by

cardusansilni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views11 pages

Nonlinearity in Structural Dynamics Chapter App G

Uploaded by

cardusansilni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Appendix G

Gradient descent and back-propagation

The back-propagation procedure for training multi-layer neural networks was

initially developed by Paul Werbos and makes its first appearance in his doctoral
thesis in 1974 [264]. Unfortunately, it languished there until the mid-eighties
when it was discovered independently by Rumelhart et al [218]. This is possibly
due to the period of dormancy that neural network research underwent following
the publication of Minsky and Paperts’ book [186] on the limitations of perceptron
networks.
Before deriving the algorithm, it will prove beneficial to consider a number
of simpler optimization problems as warm-up exercises; the back-propagation
scheme will eventually appear as a (hopefully) natural generalization.

G.1 Minimization of a function of one variable

For the sake of simplicity, a function with a single minimum is assumed. The
effect of relaxing this restriction will be discussed in a little while.
Consider the problem of minimizing the function f (x) shown in figure G.1.
If an analytical form for the function is known, elementary calculus provides the
means of solution. In general, such an expression may not be available. However,
if some means of determining the function and its first derivative at a point x is
known, the solution can be obtained by the iterative scheme described below.
Suppose the iterative scheme begins with guessing or estimating a trial
position x0 for the minimum at x m . The next estimate x1 is obtained by adding a
small amount Æx to x0 . Clearly, in order to move nearer the minimum, Æx should
be positive if x0 < xm , and negative otherwise. It appears that the answer is
needed before the next step can be carried out. However, note that

df
< 0; if x0 < xm (G.1)
dx
df
> 0; if x0 > xm : (G.2)
dx
590
Minimization of a function of one variable 591
25

x2
20

0
-4 -2 0 2 4

Figure G.1. A simple function of one variable.

So, in the vicinity of the minimum, the update rule,

df
x1 x0 = Æx = +; <0if (G.3)
dx
df
x1 x0 = Æx = ; if >0 (G.4)
dx
with a small positive constant, moves the iteration closer to the minimum. In a
simple problem of this sort, would just be called the step-size, it is essentially
the learning coefficient in the terminology of neural networks. Clearly, should
be small in order to avoid overshooting the minimum. In a more compact notation

df
Æx = sgn : (G.5)
dx
Note that j ddfx j actually increases with distance from the minimum x m . This
means that the update rule
df
Æx = (G.6)
dx
also encodes the fact that large steps are desirable when the iterate is far from
the minimum. In an ideal world, iteration of this update rule would lead to
convergence to the desired minimum. Unfortunately, a number of problems can
occur; the two most serious are now discussed.

G.1.1 Oscillation
Suppose that the function is f (x) = (x x m )2 . (This is not an unreasonable
assumption as Taylor’s theorem shows that most functions are approximated by a
quadratic in the neighbourhood of a minimum.)
As mentioned earlier, if is too large the iterate x i+1 may be on the opposite
side of the minimum to x i (figure G.2). A particularly ill-chosen value of , c
say, leads to xi+1 and xi being equidistant from x m . In this case, the iterate
592 Gradient descent and back-propagation
25

x2
20

10
-
5

0
-4
x1 -2 0 2
x0 4

Figure G.2. The problem of oscillation.

will oscillate about the minimum ad infinitum as a result of the symmetry of the
function. It could be argued that choosing = c would be extremely unlucky;
however, any values of slightly smaller than c will cause damped oscillations
of the iterate about the point x m . Such oscillations delay convergence, possibly
substantially.
Fortunately, there is a solution to this problem. Note that the updates Æ i and
Æi 1 will have opposite signs and similar magnitudes at the onset of oscillation.
This means that they will cancel to a large extent, and updating at step i with
Æi + Æi 1 would provide more stable iteration. If the iteration is not close
to oscillation, the addition of the last-but-one update produces no quantitative
difference. This circumstance leads to a modified update rule

df (xi )
Æxi = + Æxi 1 : (G.7)
dx
The new coefficient is termed the momentum coefficient, a sensible
choice of this can lead to much better convergence properties for the iteration.
Unfortunately, the next problem with the procedure is not dealt with so easily.

G.1.2 Local minima

Consider the function shown in figure G.3, this illustrates a feature—a local
minimum—which can cause serious problems for the iterative minimization
scheme. Although x m is the global minimum of the function, it is clear that
starting the iteration at any x 0 to the right of the local minimum at x lm will very
likely lead to convergence to x lm . There is no simple solution to this problem.

G.2 Minimizing a function of several variables

For this section it is sufficient to consider functions of two variables, i.e. f (x; y );
no new features appear on generalizing to higher dimensions. Consider the
Minimizing a function of several variables 593
400

x4 + 2x3 ; 20x2 + 20
300

200

100

-100

-200
-6 -4 -2 0 2 4

Figure G.3. The problem of local minima.

x2 + y 2

50
40
30
20
10

-5 0
0
-5
5

Figure G.4. Minimizing a function over the plane.

function in figure G.4. The position of the minimum is now specified by a point in
the (x; y )-plane. Any iterative procedure will require the update of both x and y .
An analogue of equation (G.6) is required. The most simple generalization would
be to update x and y separately using partial derivatives, e.g.,

@f
Æx = (G.8)
@x
which would cause a decrease in the function by moving the iterate along a line
of constant y , and
@f
Æy = (G.9)
@y
which would achieve the same with movement along a line of constant x. In fact,
this update rule proves to be an excellent choice. In vector notation, which shall be
used for the remainder of this section, the coordinates are given by fxg = (x 1 ; x2 )
and the update rule is

@f @f
fÆxg = (Æx1 ; Æx2 ) = @x ; = frgf (G.10)
1 @x2
594 Gradient descent and back-propagation

where r is the gradient operator

@f @f
frgf = @x ; : (G.11)
1 @x2
With the choices (G.8) and (G.9) for the update rules, this approach to
optimization is often referred to as the method of gradient descent.
A problem which did not occur previously is that of choosing the direction
for the iteration, i.e. the search direction. For a function of one variable, only two
directions are possible, one of which leads to an increase in the function. In two or
more dimensions, a continuum of search directions is available and the possibility
of optimally choosing the direction arises.
Fortunately, this problem admits a fairly straightforward solution. (The
following discussion follows closely to that in [66].) Suppose the current position
of the iterate is fxg0 . The next step should be in the direction which produces the
greatest decrease in f , given a fixed step-length. Without loss of generality, the
step-length can be taken as unity; the update vector, fug = (u 1 ; u2 ) is therefore a
unit vector. The problem is to maximize Æf , where
@f (fxg0) @f (fxg0)
Æf = u1 + u2 (G.12)
@x1 @x2
subject to the constraint on the step-length
u21 + u22 = 1: (G.13)
Incorporating the length constraint into the problem via a Lagrange
multiplier [233] leads to F (u 1 ; u2 ; ) as the function to be maximized, where
@f (fxg0) @f (fxg0 )
F (u1 ; u2; ) = u1 + u2 + (u21 + u22 1): (G.14)
@x1 @x2
Zeroing the derivatives with respect to the variables leads to the equations
for the optimal u 1 , u2 and .
@F @f (fxg0 ) 1 @f (fxg0 )
=0= 2u1 ) u1 = (G.15)
@u1 @x1 2 @x1
@F @f (fxg0 ) 1 @f (fxg0 )
=0= 2u2 ) u2 = (G.16)
@u2 @x2 2 @x2
@F
= 0 = 1 u21 u22 : (G.17)
@
Substituting (G.15) and (G.16) into (G.17) gives
( )
@f (fxg0 ) 2 @f (fxg0 ) 2

1 1
1
42 @x1
+
@x2
=1
42
kfrgf (fxg0)j2 = 0
(G.18)
) = jfrgf (fxg0)j: (G.19)
Training a neural network 595
x4 ; 3x3 ; 50x2 + 100 + y 4

2000
1000
0
-1000
5

-5 0
0
5 -5
10

Figure G.5. Local minimum in a function over the plane.

Substituting this result into (G.15) and (G.16) gives

1 @f (fxg0)
u1 =
jfrgf (fxg0)j @x1 (G.20)

1 @f (fxg0)
u2 =
jfrgf (fxg0)j @x2 (G.21)

or
frgf (fxg0) :
fug = jfrgf (fxg )j
(G.22)
0
A consideration of the second derivatives reveals that the + sign gives a
vector in the direction of maximum increase of f , while the sign gives a vector
in the direction of maximum decrease. This shows that the gradient descent rule
fÆxgi+1 = frgf (fxgi) (G.23)
is actually the best possible. For this reason, the approach is most often referred
to as the method of steepest descent.
Minimization of functions of several variables by steepest descent is subject
to all the problems associated with the simple iterative method of the previous
section. The problem of oscillation certainly occurs, but can be alleviated by the
addition of a momentum term. The modified update rule is then
fÆxgi+1 = frgf (fxgi) + Æfxgi : (G.24)
The problems presented by local minima are, if anything, more severe in
higher dimensions. An example of a troublesome function is given in figure G.5.
In addition to stalling in local minima, the iteration can be directed out to
infinity along valleys.

G.3 Training a neural network

The relevant tools have been developed and this section is concerned with deriving
a learning rule for training a multi-layer perceptron (MLP) network. The method
596 Gradient descent and back-propagation

of steepest descent is directly applicable; the function to be minimized is a

measure of the network error in representing a desired input–output process.
Steepest-descent is used because there is no analytical relationship between the
network parameters and the prediction error of the network. However, at each
iteration, when an input signal is presented to the network, the error is known
because the desired outputs for a given input are assumed known. Steepest-
descent is therefore a method based on supervised learning. It will be shown later
that applying the steepest-descent algorithm results in update rules coinciding
with the back-propagation rules which were stated without proof in appendix E.
This establishes that back-propagation has a rigorous basis unlike some of the
more ad hoc learning schemes. The analysis here closely follows that of Billings
et al [37].
A short review of earlier material will be given first to re-establish the
appropriate notation. The MLP network neurons are assembled into layers and
only communicate with neurons in the adjacent layers; intra-layer connections
are forbidden (see figure E.13). Each node j in layer m is connected to each node
i in the following layer m + 1 by connections of weight w ij(m+1) . The network
has l + 1 layers, layer 0 being the input layer and layer l the output. Signals are
passed through each node in layer m +1 as follows: a weighted sum is performed
(m) (m+1)
at i of all outputs xj from the preceding layer, this gives the excitation z i
of the node
nX(m)
(
zi m +1) = wij(m+1) x(jm) (G.25)
j =0
where n(m) is the number of nodes in layer m. (The summation index starts from
zero in order to accommodate the bias node.) The excitation signal is then passed
through a nonlinear activation function f to emerge as the output x i
(m+1) of the
node to the next layer
nX
(m)
x(m+1) = f (z (m+1)) = f
i i wij(m+1) x(jm) : (G.26)
j =0
Various choices for f are possible, in fact, the only restrictions on f are that
it should be differentiable and monotonically increasing [219]. The hyperbolic
tangent function f (x) = tanh(x) is used throughout this work, although the
sigmoid f (x) = (1 e x ) 1 is also very popular. The input layer nodes do
not have nonlinear activation functions as their purpose is simply to distribute the
network inputs to the nodes in the first hidden layer. The signals propagate only
forward through the layers so the network is of the feedforward type.
An exception to the rule stated earlier, forbidding connections between layers
which are not adjacent, is provided by the bias node which passes signals to all
other nodes except those in the input layer. The output of the bias node is held
constant at unity in order to allow constant offsets in the excitations. This is an
Training a neural network 597
(m)
alternative to associating a threshold i with each node so that the excitation is
calculated from
nX(m)

zi(m+1) = wij(m+1) x(jm) + i(m+1) : (G.27)

j =1
The bias node is considered to be the 0th node in each layer.
As mentioned, training of the MLP requires sets of network inputs for which
the desired network outputs are known. At each training step, a set of network
inputs is passed forward through the layers yielding finally a set of trial outputs
yî , i = 1; : : : ; n(l) . These are compared with the desired outputs y i . If the
comparison errors Æ i = yi
(l) yî are considered small enough, the network
weights are not adjusted. However, if a significant error is obtained, the error
is passed backwards through the layers and the weights are updated as the error
signal propagates back through the connections. This is the source of the name
back-propagation.
For each presentation of a training set, a measure J of the network error is
evaluated where
n(l)
1X
J (t) = (y (t) yî (t))2
2 j=1 i
(G.28)

and J is implicitly a function of the network parameters J = J ( 1 ; : : : ; n ) where

the i are the connection weights ordered in some way. The integer t labels the
presentation order of the training sets (the index t is suppressed in most of the
following theory as a single presentation is considered). After a presentation
of a training set, the steepest-descent algorithm requires an adjustment of the
parameters
@J
4i = @ = ri J (G.29)
i
where ri is the gradient operator in the parameter space. As before, the learning
coefficient determines the step-size in the direction of steepest descent. Because
only the errors for the output layer are known, it is necessary to construct effective
errors for each of the hidden layers by propagating back the error from the output
layer. For the output (lth) layer of the network an application of the chain rule of
partial differentiation [233] yields
@J @J @ yî
( ) = : :
yî @wij(l)
l (G.30)
@wij @
Now
@J
= (yi yî ) = Æi(l) (G.31)
@ yî
and as
n(l 1)
X
yî = f w(l) x(l 1)
ij j (G.32)
j =0
598 Gradient descent and back-propagation

a further application of the chain rule

@ y^i @f @zi(l)
=
@wij(l) @z (l) @wij(l)
(G.33)

where z is defined as in (G.25), yields

n(l 1)
@ y^i X (l) (l 1) (l 1)
=f 0 wij xj xj = f 0(zi(l) )x(jl 1) :
@wij(l)
(G.34)
j =0
So substituting this equation and (G.31) into (G.30) gives
nX
(l 1)
@J ( ) ( 1)
= f 0 l
wij xj l x(jl 1) Æi(l)
@wij(l)
(G.35)
j =0
and the update rule for connections to the output layer is obtained from (G.29) as
n(l 1)
X
4w(l) = f 0
ij wij(l) x(jl 1) x(jl 1) Æi(l) = f 0 (zi(l) )x(jl 1) Æi(l) (G.36)
j =0
where
f 0 (z ) = (1 + f (z ))(1 f (z )) (G.37)
if f is the hyperbolic tangent function, and
f 0 (z ) = f (z )(1 + f (z )) (G.38)
if f is the sigmoid. Note that the whole optimization hinges critically on the fact
that the transfer function f is differentiable. The existence of f 0 is crucial to the
propagation of errors to the hidden layers and to their subsequent training. This is
the reason why perceptrons could not have hidden layers and were consequently
so limited. The use of discontinuous ‘threshold’ functions as transfer functions
meant that hidden layers could not be trained.
Updating of the parameters is essentially the same for the hidden layers
(m)
except that an explicit error Æ i is not available. The errors for the hidden layer
nodes must be constructed.
Considering the (l 1)th layer and applying the chain rule once more gives

@J n(l)
X @J @ y^k @x(il 1) @zi(l 1)
= :
@wij(l 1) k=1 @ y^k @x(il 1) @zi(l 1) @wij(l 1)
(G.39)

Now
n (l 1)
@ y^k 0 X w(l) x(l 1) w(l)
= f
@x(il 1) kj j ki (G.40)
j =0
@x(il 1) n(l 2)
( 1) X (l 1) (l 2)
0 l 0
@zi(l 1) = f (zi ) = f wij xj (G.41)
j =0
Training a neural network 599

and
@zi(l 1)
( l 1) = x(jl 2) (G.42)
@wij
so (G.39) becomes
n(l) n(l 1) n(l 2)
@J X ( l) 0 X (l) (l 1) (l) 0 X (l 1) (l 2) (l 2)
= Æk f wkj xj wki f wij xj xj :
@wij(l 1) k=1 j =0 j =0
(G.43)
If the errors for the ith neuron of the (l 1)th layer are now defined as
n(l 2) n(l) n(l 1)
X X X
Æi(l 1) = f 0 wij(l 1) x(jl 2) f0 (l) x(l 1) w(l) Æ(l)
wkj j ki k (G.44)
j =0 k=1 j =0
or
X n(l)
Æi(l 1) = f 0 (zi(l 1) ) f 0 (zk(l) )wki
(l) Æ(l)
k (G.45)
k=1
then equation (G.43) takes the simple form
@J
(l 1) = Æi(l 1) x(jl 2) : (G.46)
@wij
On carrying out this argument for all hidden layers m 2l 1; l 2; : : : ; 1
the general rules
n(X
m 2) nX
(m)

Æi(m 1) (t) = f 0 wij(m 1) (t 1)x(jm 2) (t) Æk(m) (t)wki

(m) (t 1)
j =0 k=1
(G.47)
or
X n(m)
Æi(m 1) (t) = f 0 zi(m 1)(t) Æk(m) (t)wki
(m) (t 1) (G.48)
k=1
and
@J
(t) = Æi(m 1) (t)x(jm 2) (t)
@wij(m 1)
(G.49)

are obtained (on restoring the t index which labels the presentation of the training
set). Hence the name back-propagation.
Finally, the update rule for all the connection weights of the hidden layers
can be given as
wij(m) (t) = wij(m) (t 1) + 4wij(m) (t) (G.50)
where
4wij(m) (t) = Æi(m) (t)x(jm 1) (t) (G.51)
600 Gradient descent and back-propagation

for each presentation of a training set.

There is little guidance in the literature as to what the learning coefficient
should be; if it is taken too small, convergence to the correct parameters may take
an extremely long time. However, if is made large, learning is much more rapid
but the parameters may diverge or oscillate in the fashion described in earlier
sections. One way around this problem is to introduce a momentum term into the
update rule as before:

4wij(m)(t) = Æi(m) (t)x(jm 1) (t) + 4wij(m) (t 1) (G.52)

where is the momentum coefficient. The additional term essentially damps out
high-frequency variations in the error surface.
As usual with steepest-descent methods, back-propagation only guarantees
convergence to a local minimum of the error function. In fact the MLP is highly
nonlinear in the parameters and the error surface will consequently have many
minima. Various methods of overcoming this problem have been proposed, none
has met with total success.

Unconstrained Numerical Optimization An Introduction For Econometricians
100% (1)
Unconstrained Numerical Optimization An Introduction For Econometricians
32 pages
Unit 2 Introduction To Deep Learning
No ratings yet
Unit 2 Introduction To Deep Learning
79 pages
Chapter 7: Continuous Optimization (Math For Machine Learning)
No ratings yet
Chapter 7: Continuous Optimization (Math For Machine Learning)
65 pages
Non Linear Optmisation - Notes
No ratings yet
Non Linear Optmisation - Notes
24 pages
06 23ECE216 GradientDescent v2
No ratings yet
06 23ECE216 GradientDescent v2
73 pages
Week02 Convex Optimization
No ratings yet
Week02 Convex Optimization
48 pages
Eem520l3 2023
No ratings yet
Eem520l3 2023
25 pages
02 Grad Desc
No ratings yet
02 Grad Desc
54 pages
Lecture 05 - Unconstrained
No ratings yet
Lecture 05 - Unconstrained
21 pages
NLP Slides
No ratings yet
NLP Slides
201 pages
Calculus - Class Notes
No ratings yet
Calculus - Class Notes
4 pages
05 Gradient Descent
No ratings yet
05 Gradient Descent
23 pages
Lecture 7 (With Notes)
No ratings yet
Lecture 7 (With Notes)
39 pages
Unit VI Optimization Techniques Question Bank Solved Answer
No ratings yet
Unit VI Optimization Techniques Question Bank Solved Answer
20 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
3 TrainingNetwork
No ratings yet
3 TrainingNetwork
65 pages
Linear and Nonlinear Programming
No ratings yet
Linear and Nonlinear Programming
7 pages
Lecture 12
No ratings yet
Lecture 12
16 pages
Bms Basic NLP 120609
No ratings yet
Bms Basic NLP 120609
103 pages
Lecture 5 Si416 2025
No ratings yet
Lecture 5 Si416 2025
21 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
5 1 SD 17122020
No ratings yet
5 1 SD 17122020
47 pages
Mit18 S096iap23 Lec06
No ratings yet
Mit18 S096iap23 Lec06
9 pages
Appendix: 9.1 Functionals and Functional Derivatives
No ratings yet
Appendix: 9.1 Functionals and Functional Derivatives
4 pages
Chapter 3 Unconstrained Convex Optimization
No ratings yet
Chapter 3 Unconstrained Convex Optimization
28 pages
DNN M3 Optimization
No ratings yet
DNN M3 Optimization
81 pages
Process Optimization
No ratings yet
Process Optimization
70 pages
Optim
No ratings yet
Optim
70 pages
Backpropagation Optimization Tutorial
No ratings yet
Backpropagation Optimization Tutorial
14 pages
Matinf 2360 Part 3
No ratings yet
Matinf 2360 Part 3
106 pages
Machine Learning - Lecture 2
No ratings yet
Machine Learning - Lecture 2
28 pages
Lecture02a Optimization Annotated PDF
No ratings yet
Lecture02a Optimization Annotated PDF
23 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
Screenshot 2024-10-19 at 10.37.25 AM
No ratings yet
Screenshot 2024-10-19 at 10.37.25 AM
25 pages
Gradient Descent
No ratings yet
Gradient Descent
52 pages
A Novel Approach To Error Function Minimization For Feedforward Neural Networks
No ratings yet
A Novel Approach To Error Function Minimization For Feedforward Neural Networks
12 pages
Proximal Minimization With D-Functions: Gorithms
No ratings yet
Proximal Minimization With D-Functions: Gorithms
11 pages
Artificial Neural Networks: Multilayer Perceptrons Backpropagation
No ratings yet
Artificial Neural Networks: Multilayer Perceptrons Backpropagation
71 pages
CS-6777 Liu Abs
No ratings yet
CS-6777 Liu Abs
103 pages
Gradient Descent
No ratings yet
Gradient Descent
15 pages
Clnote Sept24
No ratings yet
Clnote Sept24
24 pages
AML 04 Backpropagation
100% (1)
AML 04 Backpropagation
26 pages
Mscfe XXX (Course Name) - Module X: Collaborative Review Task
No ratings yet
Mscfe XXX (Course Name) - Module X: Collaborative Review Task
19 pages
MScFE 650 MLF - Video - Transcripts - M3
No ratings yet
MScFE 650 MLF - Video - Transcripts - M3
19 pages
Setting Parameters of A Deep Neural Network - Hierarchical Representations
No ratings yet
Setting Parameters of A Deep Neural Network - Hierarchical Representations
10 pages
L3 Linear Regression and Gradient Descent
No ratings yet
L3 Linear Regression and Gradient Descent
46 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
ML Notes
No ratings yet
ML Notes
14 pages
Scaled Conjugate Gradient For Supervised Learning
No ratings yet
Scaled Conjugate Gradient For Supervised Learning
23 pages
Chương 9
No ratings yet
Chương 9
12 pages
Chapter 4
No ratings yet
Chapter 4
65 pages
Lecture 11
No ratings yet
Lecture 11
46 pages
Continuous Optimization
No ratings yet
Continuous Optimization
51 pages
Optimization 2
No ratings yet
Optimization 2
40 pages
Lecture 5
No ratings yet
Lecture 5
6 pages
Week 10 Notes MLF
No ratings yet
Week 10 Notes MLF
20 pages
Optimization Class Notes MTH-9842
No ratings yet
Optimization Class Notes MTH-9842
25 pages
Ангела Меркел - САД користе своју еконо... мачку и њену привреду - Савремени свет
No ratings yet
Ангела Меркел - САД користе своју еконо... мачку и њену привреду - Савремени свет
28 pages
Code of Practice For Demolition of Buildings 2004-Hong Kong Buildings Department-Government of
No ratings yet
Code of Practice For Demolition of Buildings 2004-Hong Kong Buildings Department-Government of
180 pages
Reinforced Concrete Structures: Y. L. Mo
No ratings yet
Reinforced Concrete Structures: Y. L. Mo
60 pages
10.1 Soil-Structure Interaction: Statement of The Problem
No ratings yet
10.1 Soil-Structure Interaction: Statement of The Problem
31 pages
Nonlinearity in Structural Dynamics Chapter App J
No ratings yet
Nonlinearity in Structural Dynamics Chapter App J
4 pages
Nonlinearity in Structural Dynamics Chapter App I
No ratings yet
Nonlinearity in Structural Dynamics Chapter App I
20 pages
Nonlinearity in Structural Dynamics Chapter App Prel
No ratings yet
Nonlinearity in Structural Dynamics Chapter App Prel
19 pages
Zagor - Igra Pet Strela LMS 045
100% (1)
Zagor - Igra Pet Strela LMS 045
81 pages
Nonlinearity in Structural Dynamics Chapter App Biblio
No ratings yet
Nonlinearity in Structural Dynamics Chapter App Biblio
14 pages
MSC Patran Nastran Student Tutorial
No ratings yet
MSC Patran Nastran Student Tutorial
72 pages
Thesis Final
No ratings yet
Thesis Final
137 pages
Machine Learning Statistical Model Using Transportation Data
No ratings yet
Machine Learning Statistical Model Using Transportation Data
32 pages
Quadratic Equation
No ratings yet
Quadratic Equation
5 pages
Brambilla 2005
No ratings yet
Brambilla 2005
8 pages
Ect306 Information Theory and Coding, June 2023
No ratings yet
Ect306 Information Theory and Coding, June 2023
3 pages
16 SVM
No ratings yet
16 SVM
41 pages
OFDM Simulink
100% (1)
OFDM Simulink
7 pages
Neural Networks Unit-3
No ratings yet
Neural Networks Unit-3
14 pages
1704719125wpdm - MATHEMATICS II JULY AUG 2023
No ratings yet
1704719125wpdm - MATHEMATICS II JULY AUG 2023
8 pages
Simplex Method of Maximization Problems Containing The Greater Than Symbol
100% (1)
Simplex Method of Maximization Problems Containing The Greater Than Symbol
10 pages
Operation researchOR MCQs
No ratings yet
Operation researchOR MCQs
8 pages
Denoising Audio Signals Using MATLAB
No ratings yet
Denoising Audio Signals Using MATLAB
7 pages
06 Binsearch
No ratings yet
06 Binsearch
45 pages
SS Iat-2
No ratings yet
SS Iat-2
2 pages
DSA Assignment - 4
No ratings yet
DSA Assignment - 4
9 pages
Program2: 1.unit Step 2.exponential 3.unit Impulse 4.sine 5.cosine 6.signum 7.ramp 8.rectangle
No ratings yet
Program2: 1.unit Step 2.exponential 3.unit Impulse 4.sine 5.cosine 6.signum 7.ramp 8.rectangle
2 pages
DTFT Table PDF
100% (1)
DTFT Table PDF
2 pages
Citstudents - In: Unit-Iii Transportation Model
No ratings yet
Citstudents - In: Unit-Iii Transportation Model
19 pages
Lecture#02, DAA, Designing Algorithms, Calculating Costs
No ratings yet
Lecture#02, DAA, Designing Algorithms, Calculating Costs
24 pages
Mlpack
No ratings yet
Mlpack
3 pages
Heterogeneous Quantum Computing For Satellite Optimization: Gideon Bass Booz Allen Hamilton
No ratings yet
Heterogeneous Quantum Computing For Satellite Optimization: Gideon Bass Booz Allen Hamilton
19 pages
Algorithm - Pseudocode of 2D CNN
No ratings yet
Algorithm - Pseudocode of 2D CNN
7 pages
Machine Learning Concept Map Book
No ratings yet
Machine Learning Concept Map Book
20 pages
BACK PROPAGATION and REGULATION, BATCH NORMALIZATION
No ratings yet
BACK PROPAGATION and REGULATION, BATCH NORMALIZATION
20 pages
Audio Equalizer Project
100% (1)
Audio Equalizer Project
44 pages
CH 4 Splines
No ratings yet
CH 4 Splines
44 pages
Audio Signal Processing in Matlab
No ratings yet
Audio Signal Processing in Matlab
25 pages
Suffix Trees
No ratings yet
Suffix Trees
76 pages
Data Structures and Algorithms
No ratings yet
Data Structures and Algorithms
5 pages
Lab 5 DTFT
No ratings yet
Lab 5 DTFT
14 pages
Digital Signal Processing: Iv B .Tech I Sem - R19
No ratings yet
Digital Signal Processing: Iv B .Tech I Sem - R19
408 pages

Nonlinearity in Structural Dynamics Chapter App G

Uploaded by

Nonlinearity in Structural Dynamics Chapter App G

Uploaded by

Appendix G

Gradient descent and back-propagation

The back-propagation procedure for training multi-layer neural networks was

G.1 Minimization of a function of one variable

Figure G.1. A simple function of one variable.

So, in the vicinity of the minimum, the update rule,

Figure G.2. The problem of oscillation.

G.1.2 Local minima

G.2 Minimizing a function of several variables

Figure G.3. The problem of local minima.

Figure G.4. Minimizing a function over the plane.

where r is the gradient operator

Figure G.5. Local minimum in a function over the plane.

Substituting this result into (G.15) and (G.16) gives

G.3 Training a neural network

of steepest descent is directly applicable; the function to be minimized is a

zi(m+1) = wij(m+1) x(jm) + i(m+1) : (G.27)

and J is implicitly a function of the network parameters J = J ( 1 ; : : : ; n ) where

a further application of the chain rule

where z is defined as in (G.25), yields

Æi(m 1) (t) = f 0 wij(m 1) (t 1)x(jm 2) (t) Æk(m) (t)wki

for each presentation of a training set.

4wij(m)(t) = Æi(m) (t)x(jm 1) (t) + 4wij(m) (t 1) (G.52)

You might also like

and J is implicitly a function of the network parameters J = J ( 1 ; : : : ; n ) where

4wij(m)(t) = Æi(m) (t)x(jm 1) (t) + 4wij(m) (t 1) (G.52)