Nonlinearity in Structural Dynamics Chapter App G
Nonlinearity in Structural Dynamics Chapter App G
df
< 0; if x0 < xm (G.1)
dx
df
> 0; if x0 > xm : (G.2)
dx
590
Minimization of a function of one variable 591
25
x2
20
15
10
0
-4 -2 0 2 4
G.1.1 Oscillation
Suppose that the function is f (x) = (x x m )2 . (This is not an unreasonable
assumption as Taylor’s theorem shows that most functions are approximated by a
quadratic in the neighbourhood of a minimum.)
As mentioned earlier, if is too large the iterate x i+1 may be on the opposite
side of the minimum to x i (figure G.2). A particularly ill-chosen value of , c
say, leads to xi+1 and xi being equidistant from x m . In this case, the iterate
592 Gradient descent and back-propagation
25
x2
20
15
10
-
5
0
-4
x1 -2 0 2
x0 4
will oscillate about the minimum ad infinitum as a result of the symmetry of the
function. It could be argued that choosing = c would be extremely unlucky;
however, any values of slightly smaller than c will cause damped oscillations
of the iterate about the point x m . Such oscillations delay convergence, possibly
substantially.
Fortunately, there is a solution to this problem. Note that the updates Æ i and
Æi 1 will have opposite signs and similar magnitudes at the onset of oscillation.
This means that they will cancel to a large extent, and updating at step i with
Æi + Æi 1 would provide more stable iteration. If the iteration is not close
to oscillation, the addition of the last-but-one update produces no quantitative
difference. This circumstance leads to a modified update rule
df (xi )
Æxi = + Æxi 1 : (G.7)
dx
The new coefficient is termed the momentum coefficient, a sensible
choice of this can lead to much better convergence properties for the iteration.
Unfortunately, the next problem with the procedure is not dealt with so easily.
x4 + 2x3 ; 20x2 + 20
300
200
100
-100
-200
-6 -4 -2 0 2 4
x2 + y 2
50
40
30
20
10
-5 0
0
-5
5
function in figure G.4. The position of the minimum is now specified by a point in
the (x; y )-plane. Any iterative procedure will require the update of both x and y .
An analogue of equation (G.6) is required. The most simple generalization would
be to update x and y separately using partial derivatives, e.g.,
@f
Æx = (G.8)
@x
which would cause a decrease in the function by moving the iterate along a line
of constant y , and
@f
Æy = (G.9)
@y
which would achieve the same with movement along a line of constant x. In fact,
this update rule proves to be an excellent choice. In vector notation, which shall be
used for the remainder of this section, the coordinates are given by fxg = (x 1 ; x2 )
and the update rule is
@f @f
fÆxg = (Æx1 ; Æx2 ) = @x ; = frgf (G.10)
1 @x2
594 Gradient descent and back-propagation
2000
1000
0
-1000
5
-5 0
0
5 -5
10
1 @f (fxg0)
u2 =
jfrgf (fxg0)j @x2 (G.21)
or
frgf (fxg0) :
fug = jfrgf (fxg )j
(G.22)
0
A consideration of the second derivatives reveals that the + sign gives a
vector in the direction of maximum increase of f , while the sign gives a vector
in the direction of maximum decrease. This shows that the gradient descent rule
fÆxgi+1 = frgf (fxgi) (G.23)
is actually the best possible. For this reason, the approach is most often referred
to as the method of steepest descent.
Minimization of functions of several variables by steepest descent is subject
to all the problems associated with the simple iterative method of the previous
section. The problem of oscillation certainly occurs, but can be alleviated by the
addition of a momentum term. The modified update rule is then
fÆxgi+1 = frgf (fxgi) + Æfxgi : (G.24)
The problems presented by local minima are, if anything, more severe in
higher dimensions. An example of a troublesome function is given in figure G.5.
In addition to stalling in local minima, the iteration can be directed out to
infinity along valleys.
@J n(l)
X @J @ y^k @x(il 1) @zi(l 1)
= :
@wij(l 1) k=1 @ y^k @x(il 1) @zi(l 1) @wij(l 1)
(G.39)
Now
n (l 1)
@ y^k 0 X w(l) x(l 1) w(l)
= f
@x(il 1) kj j ki (G.40)
j =0
@x(il 1) n(l 2)
( 1) X (l 1) (l 2)
0 l 0
@zi(l 1) = f (zi ) = f wij xj (G.41)
j =0
Training a neural network 599
and
@zi(l 1)
( l 1) = x(jl 2) (G.42)
@wij
so (G.39) becomes
n(l) n(l 1) n(l 2)
@J X ( l) 0 X (l) (l 1) (l) 0 X (l 1) (l 2) (l 2)
= Æk f wkj xj wki f wij xj xj :
@wij(l 1) k=1 j =0 j =0
(G.43)
If the errors for the ith neuron of the (l 1)th layer are now defined as
n(l 2) n(l) n(l 1)
X X X
Æi(l 1) = f 0 wij(l 1) x(jl 2) f0 (l) x(l 1) w(l) Æ(l)
wkj j ki k (G.44)
j =0 k=1 j =0
or
X n(l)
Æi(l 1) = f 0 (zi(l 1) ) f 0 (zk(l) )wki
(l) Æ(l)
k (G.45)
k=1
then equation (G.43) takes the simple form
@J
(l 1) = Æi(l 1) x(jl 2) : (G.46)
@wij
On carrying out this argument for all hidden layers m 2l 1; l 2; : : : ; 1
the general rules
n(X
m 2) nX
(m)
are obtained (on restoring the t index which labels the presentation of the training
set). Hence the name back-propagation.
Finally, the update rule for all the connection weights of the hidden layers
can be given as
wij(m) (t) = wij(m) (t 1) + 4wij(m) (t) (G.50)
where
4wij(m) (t) = Æi(m) (t)x(jm 1) (t) (G.51)
600 Gradient descent and back-propagation
where is the momentum coefficient. The additional term essentially damps out
high-frequency variations in the error surface.
As usual with steepest-descent methods, back-propagation only guarantees
convergence to a local minimum of the error function. In fact the MLP is highly
nonlinear in the parameters and the error surface will consequently have many
minima. Various methods of overcoming this problem have been proposed, none
has met with total success.