0% found this document useful (0 votes)
27 views11 pages

Nonlinearity in Structural Dynamics Chapter App G

Nonlinearity in Structural Dynamics Chapter App G

Uploaded by

cardusansilni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views11 pages

Nonlinearity in Structural Dynamics Chapter App G

Nonlinearity in Structural Dynamics Chapter App G

Uploaded by

cardusansilni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Appendix G

Gradient descent and back-propagation

The back-propagation procedure for training multi-layer neural networks was


initially developed by Paul Werbos and makes its first appearance in his doctoral
thesis in 1974 [264]. Unfortunately, it languished there until the mid-eighties
when it was discovered independently by Rumelhart et al [218]. This is possibly
due to the period of dormancy that neural network research underwent following
the publication of Minsky and Paperts’ book [186] on the limitations of perceptron
networks.
Before deriving the algorithm, it will prove beneficial to consider a number
of simpler optimization problems as warm-up exercises; the back-propagation
scheme will eventually appear as a (hopefully) natural generalization.

G.1 Minimization of a function of one variable


For the sake of simplicity, a function with a single minimum is assumed. The
effect of relaxing this restriction will be discussed in a little while.
Consider the problem of minimizing the function f (x) shown in figure G.1.
If an analytical form for the function is known, elementary calculus provides the
means of solution. In general, such an expression may not be available. However,
if some means of determining the function and its first derivative at a point x is
known, the solution can be obtained by the iterative scheme described below.
Suppose the iterative scheme begins with guessing or estimating a trial
position x0 for the minimum at x m . The next estimate x1 is obtained by adding a
small amount Æx to x0 . Clearly, in order to move nearer the minimum, Æx should
be positive if x0 < xm , and negative otherwise. It appears that the answer is
needed before the next step can be carried out. However, note that

df
< 0; if x0 < xm (G.1)
dx
df
> 0; if x0 > xm : (G.2)
dx
590
Minimization of a function of one variable 591
25

x2
20

15

10

0
-4 -2 0 2 4

Figure G.1. A simple function of one variable.

So, in the vicinity of the minimum, the update rule,


df
x1 x0 = Æx = +; <0if (G.3)
dx
df
x1 x0 = Æx = ; if >0 (G.4)
dx
with  a small positive constant, moves the iteration closer to the minimum. In a
simple problem of this sort,  would just be called the step-size, it is essentially
the learning coefficient in the terminology of neural networks. Clearly,  should
be small in order to avoid overshooting the minimum. In a more compact notation
 
df
Æx =  sgn : (G.5)
dx
Note that j ddfx j actually increases with distance from the minimum x m . This
means that the update rule
df
Æx =  (G.6)
dx
also encodes the fact that large steps are desirable when the iterate is far from
the minimum. In an ideal world, iteration of this update rule would lead to
convergence to the desired minimum. Unfortunately, a number of problems can
occur; the two most serious are now discussed.

G.1.1 Oscillation
Suppose that the function is f (x) = (x x m )2 . (This is not an unreasonable
assumption as Taylor’s theorem shows that most functions are approximated by a
quadratic in the neighbourhood of a minimum.)
As mentioned earlier, if  is too large the iterate x i+1 may be on the opposite
side of the minimum to x i (figure G.2). A particularly ill-chosen value of  ,  c
say, leads to xi+1 and xi being equidistant from x m . In this case, the iterate
592 Gradient descent and back-propagation
25

x2
20

15

10
 -
5

0
-4
x1 -2 0 2
x0 4

Figure G.2. The problem of oscillation.

will oscillate about the minimum ad infinitum as a result of the symmetry of the
function. It could be argued that choosing  =  c would be extremely unlucky;
however, any values of  slightly smaller than  c will cause damped oscillations
of the iterate about the point x m . Such oscillations delay convergence, possibly
substantially.
Fortunately, there is a solution to this problem. Note that the updates Æ i and
Æi 1 will have opposite signs and similar magnitudes at the onset of oscillation.
This means that they will cancel to a large extent, and updating at step i with
Æi + Æi 1 would provide more stable iteration. If the iteration is not close
to oscillation, the addition of the last-but-one update produces no quantitative
difference. This circumstance leads to a modified update rule

df (xi )
Æxi =  + Æxi 1 : (G.7)
dx
The new coefficient is termed the momentum coefficient, a sensible
choice of this can lead to much better convergence properties for the iteration.
Unfortunately, the next problem with the procedure is not dealt with so easily.

G.1.2 Local minima


Consider the function shown in figure G.3, this illustrates a feature—a local
minimum—which can cause serious problems for the iterative minimization
scheme. Although x m is the global minimum of the function, it is clear that
starting the iteration at any x 0 to the right of the local minimum at x lm will very
likely lead to convergence to x lm . There is no simple solution to this problem.

G.2 Minimizing a function of several variables


For this section it is sufficient to consider functions of two variables, i.e. f (x; y );
no new features appear on generalizing to higher dimensions. Consider the
Minimizing a function of several variables 593
400

x4 + 2x3 ; 20x2 + 20
300

200

100

-100

-200
-6 -4 -2 0 2 4

Figure G.3. The problem of local minima.

x2 + y 2

50
40
30
20
10

-5 0
0
-5
5

Figure G.4. Minimizing a function over the plane.

function in figure G.4. The position of the minimum is now specified by a point in
the (x; y )-plane. Any iterative procedure will require the update of both x and y .
An analogue of equation (G.6) is required. The most simple generalization would
be to update x and y separately using partial derivatives, e.g.,

@f
Æx =  (G.8)
@x
which would cause a decrease in the function by moving the iterate along a line
of constant y , and
@f
Æy =  (G.9)
@y
which would achieve the same with movement along a line of constant x. In fact,
this update rule proves to be an excellent choice. In vector notation, which shall be
used for the remainder of this section, the coordinates are given by fxg = (x 1 ; x2 )
and the update rule is
 
@f @f
fÆxg = (Æx1 ; Æx2 ) =  @x ; = frgf (G.10)
1 @x2
594 Gradient descent and back-propagation

where r is the gradient operator


 
@f @f
frgf = @x ; : (G.11)
1 @x2
With the choices (G.8) and (G.9) for the update rules, this approach to
optimization is often referred to as the method of gradient descent.
A problem which did not occur previously is that of choosing the direction
for the iteration, i.e. the search direction. For a function of one variable, only two
directions are possible, one of which leads to an increase in the function. In two or
more dimensions, a continuum of search directions is available and the possibility
of optimally choosing the direction arises.
Fortunately, this problem admits a fairly straightforward solution. (The
following discussion follows closely to that in [66].) Suppose the current position
of the iterate is fxg0 . The next step should be in the direction which produces the
greatest decrease in f , given a fixed step-length. Without loss of generality, the
step-length can be taken as unity; the update vector, fug = (u 1 ; u2 ) is therefore a
unit vector. The problem is to maximize Æf , where
@f (fxg0) @f (fxg0)
Æf = u1 + u2 (G.12)
@x1 @x2
subject to the constraint on the step-length
u21 + u22 = 1: (G.13)
Incorporating the length constraint into the problem via a Lagrange
multiplier  [233] leads to F (u 1 ; u2 ; ) as the function to be maximized, where
@f (fxg0) @f (fxg0 )
F (u1 ; u2; ) = u1 + u2 + (u21 + u22 1): (G.14)
@x1 @x2
Zeroing the derivatives with respect to the variables leads to the equations
for the optimal u 1 , u2 and .
@F @f (fxg0 ) 1 @f (fxg0 )
=0= 2u1 ) u1 = (G.15)
@u1 @x1 2 @x1
@F @f (fxg0 ) 1 @f (fxg0 )
=0= 2u2 ) u2 = (G.16)
@u2 @x2 2 @x2
@F
= 0 = 1 u21 u22 : (G.17)
@
Substituting (G.15) and (G.16) into (G.17) gives
( )
@f (fxg0 ) 2 @f (fxg0 ) 2
 
1 1
1
42 @x1
+
@x2
=1
42
kfrgf (fxg0)j2 = 0
(G.18)
)  = jfrgf (fxg0)j: (G.19)
Training a neural network 595
x4 ; 3x3 ; 50x2 + 100 + y 4

2000
1000
0
-1000
5

-5 0
0
5 -5
10

Figure G.5. Local minimum in a function over the plane.

Substituting this result into (G.15) and (G.16) gives


1 @f (fxg0)
u1 = 
jfrgf (fxg0)j @x1 (G.20)

1 @f (fxg0)
u2 = 
jfrgf (fxg0)j @x2 (G.21)

or
frgf (fxg0) :
fug =  jfrgf (fxg )j
(G.22)
0
A consideration of the second derivatives reveals that the + sign gives a
vector in the direction of maximum increase of f , while the sign gives a vector
in the direction of maximum decrease. This shows that the gradient descent rule
fÆxgi+1 = frgf (fxgi) (G.23)
is actually the best possible. For this reason, the approach is most often referred
to as the method of steepest descent.
Minimization of functions of several variables by steepest descent is subject
to all the problems associated with the simple iterative method of the previous
section. The problem of oscillation certainly occurs, but can be alleviated by the
addition of a momentum term. The modified update rule is then
fÆxgi+1 = frgf (fxgi) + Æfxgi : (G.24)
The problems presented by local minima are, if anything, more severe in
higher dimensions. An example of a troublesome function is given in figure G.5.
In addition to stalling in local minima, the iteration can be directed out to
infinity along valleys.

G.3 Training a neural network


The relevant tools have been developed and this section is concerned with deriving
a learning rule for training a multi-layer perceptron (MLP) network. The method
596 Gradient descent and back-propagation

of steepest descent is directly applicable; the function to be minimized is a


measure of the network error in representing a desired input–output process.
Steepest-descent is used because there is no analytical relationship between the
network parameters and the prediction error of the network. However, at each
iteration, when an input signal is presented to the network, the error is known
because the desired outputs for a given input are assumed known. Steepest-
descent is therefore a method based on supervised learning. It will be shown later
that applying the steepest-descent algorithm results in update rules coinciding
with the back-propagation rules which were stated without proof in appendix E.
This establishes that back-propagation has a rigorous basis unlike some of the
more ad hoc learning schemes. The analysis here closely follows that of Billings
et al [37].
A short review of earlier material will be given first to re-establish the
appropriate notation. The MLP network neurons are assembled into layers and
only communicate with neurons in the adjacent layers; intra-layer connections
are forbidden (see figure E.13). Each node j in layer m is connected to each node
i in the following layer m + 1 by connections of weight w ij(m+1) . The network
has l + 1 layers, layer 0 being the input layer and layer l the output. Signals are
passed through each node in layer m +1 as follows: a weighted sum is performed
(m) (m+1)
at i of all outputs xj from the preceding layer, this gives the excitation z i
of the node
nX(m)
(
zi m +1) = wij(m+1) x(jm) (G.25)
j =0
where n(m) is the number of nodes in layer m. (The summation index starts from
zero in order to accommodate the bias node.) The excitation signal is then passed
through a nonlinear activation function f to emerge as the output x i
(m+1) of the
node to the next layer
nX
(m) 
x(m+1) = f (z (m+1)) = f
i i wij(m+1) x(jm) : (G.26)
j =0
Various choices for f are possible, in fact, the only restrictions on f are that
it should be differentiable and monotonically increasing [219]. The hyperbolic
tangent function f (x) = tanh(x) is used throughout this work, although the
sigmoid f (x) = (1 e x ) 1 is also very popular. The input layer nodes do
not have nonlinear activation functions as their purpose is simply to distribute the
network inputs to the nodes in the first hidden layer. The signals propagate only
forward through the layers so the network is of the feedforward type.
An exception to the rule stated earlier, forbidding connections between layers
which are not adjacent, is provided by the bias node which passes signals to all
other nodes except those in the input layer. The output of the bias node is held
constant at unity in order to allow constant offsets in the excitations. This is an
Training a neural network 597
(m)
alternative to associating a threshold i with each node so that the excitation is
calculated from
nX(m)

zi(m+1) = wij(m+1) x(jm) + i(m+1) : (G.27)


j =1
The bias node is considered to be the 0th node in each layer.
As mentioned, training of the MLP requires sets of network inputs for which
the desired network outputs are known. At each training step, a set of network
inputs is passed forward through the layers yielding finally a set of trial outputs
y^i , i = 1; : : : ; n(l) . These are compared with the desired outputs y i . If the
comparison errors Æ i = yi
(l) y^i are considered small enough, the network
weights are not adjusted. However, if a significant error is obtained, the error
is passed backwards through the layers and the weights are updated as the error
signal propagates back through the connections. This is the source of the name
back-propagation.
For each presentation of a training set, a measure J of the network error is
evaluated where
n(l)
1X
J (t) = (y (t) y^i (t))2
2 j=1 i
(G.28)

and J is implicitly a function of the network parameters J = J ( 1 ; : : : ; n ) where


the i are the connection weights ordered in some way. The integer t labels the
presentation order of the training sets (the index t is suppressed in most of the
following theory as a single presentation is considered). After a presentation
of a training set, the steepest-descent algorithm requires an adjustment of the
parameters
@J
4i =  @ = ri J (G.29)
i
where ri is the gradient operator in the parameter space. As before, the learning
coefficient  determines the step-size in the direction of steepest descent. Because
only the errors for the output layer are known, it is necessary to construct effective
errors for each of the hidden layers by propagating back the error from the output
layer. For the output (lth) layer of the network an application of the chain rule of
partial differentiation [233] yields
@J @J @ y^i
( ) = : :
y^i @wij(l)
l (G.30)
@wij @
Now
@J
= (yi y^i ) = Æi(l) (G.31)
@ y^i
and as
n(l 1) 
X
y^i = f w(l) x(l 1)
ij j (G.32)
j =0
598 Gradient descent and back-propagation

a further application of the chain rule


@ y^i @f @zi(l)
=
@wij(l) @z (l) @wij(l)
(G.33)

where z is defined as in (G.25), yields


n(l 1) 
@ y^i X (l) (l 1) (l 1)
=f 0 wij xj xj = f 0(zi(l) )x(jl 1) :
@wij(l)
(G.34)
j =0
So substituting this equation and (G.31) into (G.30) gives
nX
(l 1) 
@J ( ) ( 1)
= f 0 l
wij xj l x(jl 1) Æi(l)
@wij(l)
(G.35)
j =0
and the update rule for connections to the output layer is obtained from (G.29) as
n(l 1) 
X
4w(l) = f 0
ij wij(l) x(jl 1) x(jl 1) Æi(l) = f 0 (zi(l) )x(jl 1) Æi(l) (G.36)
j =0
where
f 0 (z ) = (1 + f (z ))(1 f (z )) (G.37)
if f is the hyperbolic tangent function, and
f 0 (z ) = f (z )(1 + f (z )) (G.38)
if f is the sigmoid. Note that the whole optimization hinges critically on the fact
that the transfer function f is differentiable. The existence of f 0 is crucial to the
propagation of errors to the hidden layers and to their subsequent training. This is
the reason why perceptrons could not have hidden layers and were consequently
so limited. The use of discontinuous ‘threshold’ functions as transfer functions
meant that hidden layers could not be trained.
Updating of the parameters is essentially the same for the hidden layers
(m)
except that an explicit error Æ i is not available. The errors for the hidden layer
nodes must be constructed.
Considering the (l 1)th layer and applying the chain rule once more gives

@J n(l)
X @J @ y^k @x(il 1) @zi(l 1)
= :
@wij(l 1) k=1 @ y^k @x(il 1) @zi(l 1) @wij(l 1)
(G.39)

Now
 n (l 1) 
@ y^k 0 X w(l) x(l 1) w(l)
= f
@x(il 1) kj j ki (G.40)
j =0
@x(il 1) n(l 2) 
( 1) X (l 1) (l 2)
0 l 0
@zi(l 1) = f (zi ) = f wij xj (G.41)
j =0
Training a neural network 599

and
@zi(l 1)
( l 1) = x(jl 2) (G.42)
@wij
so (G.39) becomes
n(l) n(l 1)  n(l 2) 
@J X ( l) 0 X (l) (l 1) (l) 0 X (l 1) (l 2) (l 2)
= Æk f wkj xj wki f wij xj xj :
@wij(l 1) k=1 j =0 j =0
(G.43)
If the errors for the ith neuron of the (l 1)th layer are now defined as
n(l 2)  n(l) n(l 1) 
X X X
Æi(l 1) = f 0 wij(l 1) x(jl 2) f0 (l) x(l 1) w(l) Æ(l)
wkj j ki k (G.44)
j =0 k=1 j =0
or
X n(l)
Æi(l 1) = f 0 (zi(l 1) ) f 0 (zk(l) )wki
(l) Æ(l)
k (G.45)
k=1
then equation (G.43) takes the simple form
@J
(l 1) = Æi(l 1) x(jl 2) : (G.46)
@wij
On carrying out this argument for all hidden layers m 2l 1; l 2; : : : ; 1
the general rules
n(X
m 2) nX
(m)

Æi(m 1) (t) = f 0 wij(m 1) (t 1)x(jm 2) (t) Æk(m) (t)wki


(m) (t 1)
j =0 k=1
(G.47)
or
 X n(m)
Æi(m 1) (t) = f 0 zi(m 1)(t) Æk(m) (t)wki
(m) (t 1) (G.48)
k=1
and
@J
(t) = Æi(m 1) (t)x(jm 2) (t)
@wij(m 1)
(G.49)

are obtained (on restoring the t index which labels the presentation of the training
set). Hence the name back-propagation.
Finally, the update rule for all the connection weights of the hidden layers
can be given as
wij(m) (t) = wij(m) (t 1) + 4wij(m) (t) (G.50)
where
4wij(m) (t) = Æi(m) (t)x(jm 1) (t) (G.51)
600 Gradient descent and back-propagation

for each presentation of a training set.


There is little guidance in the literature as to what the learning coefficient 
should be; if it is taken too small, convergence to the correct parameters may take
an extremely long time. However, if  is made large, learning is much more rapid
but the parameters may diverge or oscillate in the fashion described in earlier
sections. One way around this problem is to introduce a momentum term into the
update rule as before:

4wij(m)(t) = Æi(m) (t)x(jm 1) (t) + 4wij(m) (t 1) (G.52)

where is the momentum coefficient. The additional term essentially damps out
high-frequency variations in the error surface.
As usual with steepest-descent methods, back-propagation only guarantees
convergence to a local minimum of the error function. In fact the MLP is highly
nonlinear in the parameters and the error surface will consequently have many
minima. Various methods of overcoming this problem have been proposed, none
has met with total success.

You might also like