7 Neural - Network
7 Neural - Network
Neural Networks
1
Figure 1.1: Examples of linear and nonlinear modelling problems.
i n = m: Exact solution.
The Least Square and Minimum Norm solutions are discussed below.
δE
= 0. (1.3)
δw
2
On using (1.2) in (1.3) gives
Example 1.1. The velocities of left and right wheels of a mobile robot is
given in Table I.
The two models for any of the wheel velocities are written as
v1 = w1 s + w0 and v2 = w2 s2 + w1 s + w0 . (1.6)
3
On evaluating w parameters for both the models using (1.5) are
ŷ1 = v̂l = [10.3 8.9 6.1 4.7]⊤ and ŷ2 = v̂r = [9.8 9.4 6.6 4.2]⊤ .
If n < m, i.e., more parameters and less number of equations, then the so-
lution is obtained as
Minimize ∥ w ∥= w⊤ w (1.9)
subject to Aw = y.
Aw = y
−AA⊤ λ ⊤ = y. (1.12)
• w:
δE
= 0. (1.13)
δw
4
On using (1.10) in (1.13) gives
w⊤ + λ A = 0 ⇒ w = −A⊤ λ ⊤ (1.14)
Example 1.2. The data set consists of 50 samples from each of three species
of Iris (Iris setosa, Iris virginica, and Iris versicolor). Four features are mea-
sured from each sample: the length and the width of the sepals and petals,
in centimeters. Based on the combination of these four features, Fisher []
developed a linear discriminant model to distinguish the species from each
other.
Figure 1.3: The first three figures are three different species of Iris flower
and the fourth one shows ‘petal’and ‘sepal’of the flower.
Solution. The data can be represented via linear model using the following
Figure 1.4: The three different species based on UCI Iris data.
5
expression
Aw = y. (1.16)
For this we need to load data from the required dataset and perform some
basic steps to obtain the model parameters and associated weights. The
code snippets in Python are given below.
i Loading data in Python:
ii Input feature and output label representation: After iris data is im-
ported in Python, an object, i.e., a dict data represents input feature
and targets represent output label.
code
iris=load_iris()
print(‘keys in iris dictionary: ’, iris.keys())
X=iris[‘data’]
print(‘First 3 entries of X:’, X[:3])
Y=iris[‘data’]
print(‘First 3 entries of Y:’, Y[:3])
names=iris[‘target names’]
print(‘names:’, names)
feature_names=iris[‘feature_names ’]
print(‘feature_names:’, feature_names)
# Track a few sample points
isamples=np.random.randint(len(Y), size=(5))
print(isamples)
solution
keys in iris dictionary: dict_keys([‘data’,‘target’, ‘frame’,
‘target_names’,‘DESCR’,
‘feature_names’, ‘filename’, ‘data_module’])
5.1 3.5 1.4 0.2
First 3 entries of X: 4.9 3.0 1.4 0.2
4.7 3.2 1.3 0.2
First 3 entries of Y: 0 0 0
names: {‘setosa’, ‘versicolor’, ‘virginica’}
feature_names: { ‘sepal length (cm)’, ‘sepal width (cm)’, ‘petal length (cm)’,
‘petal
width (cm)’}
9 125 15 64 113
6
iii Shape of data: Data is shaped into different sample types.
code
print(‘Shape of X:’, X.shape)
print(‘Shape of Y:’, Y.shape)
print(‘X - samples: ’, X[isamples])
print(‘Y - samples: ’, Y[isamples])
solution
Shape of X: (150, 4)
Shape of Y: (150,)
4.9 3.1 1.5 0.1
7.2 3.2 6.0 1.8
X- samples: 5.7 4.4 1.5 0.4
5.6 2.9 3.6 1.3
5.7 2.5 5.0 2.0
Y - samples: 0 2 0 1 2
code
from keras.utils import to_categorical
#Ny is number of categories/classes
Ny=len(np.unique(Y))
print(‘Ny:’, Ny)
# Convert to 1-hot encoding
Y=to_categorical(Y[:], num_classes=Ny)
print(‘X - samples:’, X[isamples])
print(‘Y - samples:’, Y[isamples])
solution
Ny: 3
4.9 3.1 1.5 0.1
7.2 3.2 6.0 1.8
X - samples: 5.7 4.4 1.5 0.4
5.6 2.9 3.6 1.3
5.7 2.5 5.0 2.0
1 0 0
0 0 1
Y - samples: 1 0 0
0 1 0
0 0 1
7
code
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
test_size = 0.20, random_state = 1)
print(‘X_train shape:’, X_train.shape)
print(‘X_test.shape:’, X_test.shape)
solution
vi Scaling the data (X): To use the data appropriately it is scaled to zero-
mean and unit-variance.
code
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# Computes the mean and standard deviation
scaler.fit(X_train)
# Perform transformation: x = (x-mean)/std
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
# first 5 samples of X_train
print(‘X_train: ’, X_train[:5])
print(‘Y_train:’, Y_train[:5])
solution
X_train:
0.31553662 −0.04578885 0.44767531 0.23380268
2.2449325 −0.04578885 1.29769171 1.39742892
−0.2873996 −1.24028061 0.05100098 −0.15407273
0.67729835 −0.52358555 1.01435291 1.13884531
−0.04622511
−0.52358555
0.73101411 1.52672073
0 1 0
0 0 1
Y_train: 0 1 0
0 0 1
0 0 1
8
ls bs lp bp y0 y1 y2
0.31 -0.04 0.45 0.23 0 1 0
2.24 -0.04 1.30 1.40 0 0 1
-0.29 -1.24 0.05 -0.15 0 1 0
0.68 -0.52 1.01 1.14 0 0 1
-0.05 -0.52 0.73 1.53 0 0 1
To find the solution, i.e., associated weights, (1.17) is written in the form
AW = Y, (1.18)
where A = ls bs lp bp 1 and Y = y0 y1 y2 . On solving (1.18) fur-
ther using the least square solution
gives
0 −0.4543 0.0665
0 −0.3726 −0.1308
W =
0 2.0741 −0.0692 .
0 −1.2851 0.6930
9
function network have feed-forward connection structures. When these
networks are represented as controllers or as system models, it becomes
relatively easy to perform closed-loop analysis of a control system in terms
of stability and convergence. Thus extensive use of feed-forward networks
in intelligent control system development in the existing literature is not
surprising.
10
Each neuron actuates response using sigmoidal activation function. As
mentioned earlier, MNNs have been successfully applied in solving some
difficult problems by training them in a supervised manner with the er-
ror back-propagation algorithm. The back-propagation algorithm uses the
principle of gradient descent to train the network parameters. The synopsis
x1 y1
x2 y2
xm yn
Input Wi1 ,i0 Layer 1 Wi2 ,i1 Layer 2 Wi3 ,i2 Layer 3 WiL ,iL−1 Layer L
i0 i1 i2 i3 iL
layer
where E p = 12 (ydp − y p )2 . y p and ydp denote the pth actual and desired pat-
tern. Since y p is a function of the network weight vector w, E can also be
11
expressed as a function of w. Consider a typical relationship between the
cost function E and the weight vector w, as shown in Figure 1.6, where the
cost function has only one global minimum. The principle of gradient de-
∂E
wnew = wold − η
∂w
scent tells that to minimize the cost function E, the weight vector w should
be updated using the following rule:
∂E
wnew = wold − η
∂w wold
where η is the learning rate. Figure 1.6 shows that E attains its minimum
value at w = wmin . One should notice that when w is less than wmin , the slope
∂E
∂ w is negative thus the change in w is positive which will move w towards
wmin . Similarly when w is greater than wmin , the change in w is negative
which makes w to move in the left direction, i.e., towards the direction of
wmin . In both the cases, w will tend to wmin in a number of steps depending
on the learning rate η .
Batch Update: When the weight vector w is updated such that ∆w = wnew −
wold is a function of the overall cost function E, i.e., ∆w = −η ∂∂ w
E
, the up-
date rule is termed as a batch update which is a offline technique of weight
update.
12
1.3.2 Derivation of Back Propagation Algorithm
The back propagation (BP) algorithm [Lipp:87, naren:91] offers an effec-
tive approach to the computation of the gradients. This can be applied to
any optimization formulation, i.e., any type of energy function, either max-
imization or minimization problem.
Let us consider a two layered network as shown in Figure 1.7 where i2 ,
i1 and i0 refer to output, hidden, and input layers, respectively.
The objective of this algorithm is to adjust the weights Wi2 i1 and Wi1 i0 in
order to minimize a cost function E which will train the network mapping
of the required function.
v1
x1
y1
x2
v2
x3 y2
xm yn2
Wi1 ,i0 vn1 Wi2 ,i1
Input
i0 Layer 1 Layer 2
i1 i2
Forward Phase
As shown in Figure 1.7, the input to the ith
1 neuron of the hidden layer is
given by
m
hi1 = ∑ Wi1i0 xi0 , (1.21)
i0 =1
1
yi2 = vi2 = Ψ(hi2 ) = , (1.23)
1 + e−hi2
13
where
n1
hi2 = ∑ Wi2i1 vi1 . (1.24)
i1 =1
Back-Propagation of Error
where
1
Ei2 (t) = (ydi2 (t) − yi2 (t))2 (1.26)
2
and ydi2 is desired response of the ith
2 unit of the output layer.
Update of the weights connecting the output layer and the hidden
layer
∂ E(t)
Wi2 i1 (t + 1) = Wi2 i1 (t) − η , (1.27)
∂ Wi2 i1 (t)
∂E
where η is the learning rate. The derivative ∂ Wi2 i1 is computed as follows:
∂ E(t) n2
∂ Ei2 (t) ∂ yi2
= ∑ × , (1.28)
∂ Wi2 i1 (t) i2 =1 ∂ yi2 ∂ Wi2 i1 (t)
∂ Ei2 (t) 1
= − × 2(ydi2 − yi2 ) (1.29)
∂ yi2 2
The following relations are obtained using equations (1.23) and (1.24), re-
14
spectively:
∂ yi2 ∂ 1
= = yi2 (1 − yi2 ) and (1.31)
∂ hi2 ∂ hi2 1 + e−hi2
∂ hi2
= vi1 . (1.32)
∂ Wi2 i1
The gradient (1.28) is obtained using equations (1.29), (1.31), and (1.32) as
∂ E(t)
= −(ydi2 − yi2 )yi2 (1 − yi2 )vi1 . (1.33)
∂ Wi2 i1 (t)
where δi2 = yi2 (1 − yi2 )(ydi2 − yi2 ) is the error back propagated from output
layer.
Update of the weights connecting the hidden layer and the input
layer
Unlike a neuron in the output layer, a neuron in the hidden layer has no
specified desired response. Thus the output error has to be back-propagated
so that weights connecting to hidden layer from the input layer can be up-
dated.
As usual, the update of the weight using the gradient descent algorithm
looks as
∂ E(t)
Wi1 i0 (t + 1) = Wi1 i0 (t) − η (1.35)
∂ Wi1 i0 (t)
∂E
The derivative term ∂ Wi1 i0 is computed as
∂ E(t) n2
∂ Ei2 (t) ∂ yi2
= ∑ × (1.36)
∂ Wi1 i0 (t) i2 =1 ∂ yi2 ∂ Wi1 i0 (t)
where the first term is computed using equation (1.29) and the second term
is computed as follows:
15
In this equation first term is computed using equations (1.23) and (1.24):
while the second term is computed using equations (1.21), and (1.22):
Thus the final computable expression for the gradient term (1.36) is ob-
tained using equations (1.29),(1.38) and (1.39):
∂E n2
∂ Wi1 i0
= − ∑ (ydi2 − yi2 )yi2 (1 − yi2 )Wi2i1 vi1 (1 − vi1 )xi0
i2 =1
n2
= −vi1 (1 − vi1 )xi0 ∑ δi2Wi2i1 .
i2 =1
Thus the update law (1.35) has the final computable form:
16
where vi is the output of ith neuron of layer k − 1. The term δik for each layer
k−1
is expressed as
δiL = (ydi − yiL )yiL (1 − yiL ) for output layer Land (1.42)
L
nk
δil = vil (1 − vil ) ∑ δil+1 Wil+1 il for other hidden layers. (1.43)
i=1
All the weights can be updated in a similar manner. Thus for example, if
one considers a four layered network, the weight update rule for 4th layer
or the output layer will be:
yi4 is the ith unit of the output layer and ydi is the corresponding desired
4
output. Similarly, weight update law for 3rd layer is derived as follows:
17
the influences from an upper layer to a lower layer and vice-versa can only
be effected via the error signals of the intermediate layer.
Example 1.3. Let us consider the following discrete time dynamical sys-
tem.
y(k)
y(k + 1) = + u3 (k) (1.44)
1 + y2 (k)
Identify the system using a multi-layered network.
Solution
Since a feed-forward network is a static network, when a dynamical system
is identified using a this kind of network, the previous state is also input to
the network along with the original system input. The system being a stable
one, input is generated randomly between ±1. 1000 data pairs are gener-
ated with this input range using equation (1.44). An MLN is considered
with one input layer, one hidden layer and one output layer. The number
of neurons in the hidden layer is taken as 15 and sigmoid function is taken
as the activation function. An instantaneous update using back propaga-
tion algorithm is employed to train the network. The training results are
shown in Figure 1.8 where the left figure shows the input-output relation-
ship of the training data as well as the network output after training. Right
figure shows the desired and actual output of the network at every sampling
instant. The convergence of mean square error over 1500 epochs is shown
in Figure 1.9 (left). Once the training is over, 1000 new data points are gen-
erated to validate the identified model while taking the input to the system
as u(k) = sin(0.1 k). The result of model validation is shown in Figure 1.9
(right). The RMS error for the training data set is found to 0.022 while for
the testing data set it is 0.034. The MATLAB code for the above example is
given as follows.
2
2
d
y d
y y
y
1 1
y ,y
y ,y
0 0
d
-1 -1
-2 -2
-1 -0.5 0 0.5 1 0 100 200 300 400 500
u Sampling Instant
18
0.07 d
2 y
y
0.06
Mean square error
0.05 1
y ,y
0.04
0
d
0.03
0.02 -1
0.01
-2
0
0 300 600 900 1200 1500 0 100 200 300 400 500
No. of epochs Sampling Instant
Figure 1.9: Example 1.3- left: Epoch wise convergence of mean square error
and right: Model validation using the test data.
19
1.3.4 Convergence of BP Learning Algorithm
The weight update law in back propagation is derived using the gradient
descent algorithm where a cost function E has to be minimized with respect
to the weight vector w. The gradient-descent algorithm ensures that the first
order approximation of E(w) reaches its global minimum but it does not
ensure global convergence of the original cost function E(w). The weight
update law is given as
where η > 0 is the learning rate and ∇E is the gradient of cost function
E. Learning rate η determines the speed of convergence. Convergence de-
pends on proper choice of initial conditions. Let us consider minimization
of following function E(w) = 0.5w2 − 8 sin w + 7. It is a multi-modal func-
tion with two local minima and a global minimum.
The gradient-descent update law for parameter w is obtained as
∂E
∆w = −η = −η (w − 8 cos w)
∂w
Figure 1.10 shows the convergence of the cost function with respect to the
weight w. It can be seen from the figure that when η = 0.01 and w0 = −8,
GD η = 0.01
60 GD η = 0.4
Actual
W0 = -8
40
f(w)
20
0
-10 -5 0 5 10
w
20
not converge for larger step size. Moreover when the step size is large, it
zigzags its way about the true direction to the global minimum thus leading
to a slow convergence. There is no comprehensive method to select initial
weight w0 and learning rate η for a given problem so that the optimal weight
vector can be reached.
The problem arises from error surfaces with step sides and shallow valley
floor. One simple but effective way to deal with this problem is addition of
a momentum factor. This allows us to safely increase the learning rate and
to avoid long flat surfaces.
Adding a momentum factor simply means that one is giving a momen-
tum to the learning, i.e. learning rate is made faster. If α is the momentum
rate, then adding a momentum leads to the following change in the weight
update law.
∂E
w(t + 1) = w(t) − η + α [w(t) − w(t − 1)]
∂ w(t)
w(t + 1) − w(t) = w(t) ∂E [w(t) − w(t − 1)]
= −η +α
∆w ∂ w(t) ∆w
∂E
(1 − α )∆w − η
∂ w(t)
η ∂E
∆w = − × (1.45)
1 − α ∂ w(t)
21
From the above equation one can find that learning rate is increased by a
factor of (1−1α ) . Thus the plateau region can be crossed at much faster rate.
• An input layer
• A hidden layer
• An output layer
The hidden units provide a set of functions that constitute an arbitrary basis
for the input patterns.
1. Hidden units are known as radial centers. Each radial center is rep-
resented by a vector ci , i = 1, . . . , h, where h is the number of radial
centers in the hidden layer.
22
x1 b
ϕ1
x2 b
w1
c2
ϕ2
x3 b w2
.. y
..
x4 b
.. wn
.. ..
.. .
..
.. ch ϕh
.
xp b
center ci . This would activate the hidden center ci and by proper choice of
weights the target output is obtained. Suppose an input vector lies between
two receptive field centers, then both those hidden units will be apprecia-
bly activated. The output will be a weighted average of the corresponding
targets. In short the inputs are clustered around the centers and the output
is linear in terms of weights, wi .
h
y = ∑ ϕi wi (1.46)
i=1
The response of the ith radial center in RBFN usually expressed by the fol-
lowing expression:
ϕi = ϕ (||x − ci ||) (1.47)
23
where ||x − ci || is the Euclidean distance between x and ci . ϕ (.) is a radial
basis function. When this basis function is Gaussian, each node produces
an identical output for inputs within a fixed radial distance from the center,
i.e. they are radially symmetric. This accounts for the nomenclature ’radial
basis function’. Other activation functions include
Pseudo-Inverse Technique
d
σ=√ (1.48)
2h
where d is the spread of the centers, i.e. the maximum distance between
two centers, and h is the number of centers.
Such a choice of σ helps to avoid extremas. Moreover performance of
the network is not very sensitive to the precise values of σ .
Once the width of the basis function is fixed, next task is to learn the
weights. For a given input-output pattern (x p , y p ), the output in terms of
weight vector w can be written as
y p = ϕ pT w (1.49)
24
bias = +1
rs
x1 rs c1 ϕ1
w1 θ
w2
x2 rs c2 ϕ2
(− h
||x p −c ||2 )
where ith element of ϕ p is computed as ϕip = e d 2 i
. Collection of
such expressions over all patterns will lead to following least square prob-
lem:
Φw = y (1.50)
T
where ith row of Φ is ϕ i . y = [y1 , . . . , y p , . . .]T . The weight vector w can be
solved using the following pseudo inverse:
Example 1.4. Consider the RBF network as shown in Figure 1.12 to solve
an EX-NOR problem:
x1 x2 y
0 0 1
0 1 0
1 0 0
1 1 1
The output of the RBF network is given by
y = w1 ϕ1 + w2 ϕ2 + θ
Assuming Gaussian radial centers, ϕ (x) = e−x , find out the weights, w1
2
and w2 , associated with the two inputs of the network and θ , the weight,
associated with the output bias.
Solution
Applying the four training patterns one after another following equations
are obtained:
w1 + w2 e−2 + θ = 1
w1 e−1 + w2 e−1 + θ = 0
w1 e−1 + w2 e−1 + θ = 0
25
w1 e−2 + w2 + θ = 1
Thus there are 4 equations to be solved for 3 unknowns w1 , w2 , θ . In this
case,
1 0.1353 1 1
0.3679 0.3679 1 w1 0
Φ=
0.3679 0.3679 1 , w= w2 , y = 0
θ
0.1353 1 1 1
Using
equation(1.51), one gets the desired weights as
2.5031
w= 2.5031 .
−1.848
One of the most natural approach to update centers and weights is super-
vised training by error correction learning. This is easily implemented by
using a gradient descent procedure. The update rule for center learning is
given below.
∂ E(t)
ci j (t + 1) = ci j (t) − η1
∂ ci j (t)
for i = 1 to p and j = 1 to M.
Similarly the update for weights are given as
∂ E(t)
wi (t + 1) = wi (t) − η2
∂ wi (t)
where the cost function is E = 12 ∑(yd − y)2
The actual response of the RBF network as shown in figure 1.11 is com-
puted as
h
y = ∑ ϕi wi
i=1
ϕi = e−zi /2σ
2 2
∂E ∂E ∂y
= × = −(yd − y)ϕi (1.52)
∂ wi ∂ y ∂ wi
26
Differentiating E with respect to ci j yields:
∂E ∂E ∂y ∂ ϕi
= × × (1.53)
∂ ci j ∂ y ∂ ϕi ∂ ci j
∂ ϕi ∂ zi
= −(yd − y) × wi × × (1.54)
∂ zi ∂ ci j
∂ ϕi zi
= − 2 ϕi (1.55)
∂ zi σ
∂ zi ∂
∂ ci j ∑
= ( (x j − ci j )2 )1/2 (1.56)
∂ ci j j
= −(x j − ci j )/zi (1.57)
The gradient descent vector ∂ E/∂ ci j exhibits a clustering effect. Note that
for RBF network, there is no back-propagation of error unlike the super-
vised learning in multi-layered network.
27
2
2
d
y d
y y
y
1 1
y ,y
y ,y
0 0
d
d
-1 -1
-2 -2
-1 -0.5 0 0.5 1 0 100 200 300 400 500
u Sampling Instant
(a) (b)
0.2
0.15
Mean square error
0.1
0.05
0
0 300 600 900 1200 1500
No. of epochs
Figure 1.14: Example 1.5: Epoch wise convergence of mean square error
28
d
2 y
y
y ,y
0
d -1
-2
Figure 1.15: Example 1.5: Model validation using the test data
d(1)=0.5;p=zeros(2,1000);
for k=1:1000
u(k)= sin(0.1*k);
d(k+1) = d(k)/(1+d(k)*d(k)) + u(k)*u(k)*u(k);
p(1,k)=u(k); p(2,k)=d(k); t(k)=d(k+1);
end
y = sim(net,p);
figure
plot(t);
hold on;
plot(y,’r’)
29
Hybrid Learning
In hybrid learning, the radial basis functions relocate their centers in a self
organized manner while the weights of the output layers are computed us-
ing supervised learning rule. Hence the name ’hybrid learning’. Due to
self-organized learning the radial centers are placed in those regions where
significant input data is present. When a pattern is presented to the RBF
network during training, either a new center is grown if the pattern is suf-
ficiently novel or the network parameters in both layers are updated using
gradient descent. The test for novelty relies on two criteria:
1. Is the Euclidean distance between the input pattern and the nearest
existing center greater than a threshold δ (t) ?
2. Is the mean square error at the output greater than a desired accuracy
ε?
A new center is allocated when both of the novelty criteria are satisfied.
K-mean clustering technique is employed for the self organized selec-
tion of centers. It dynamically updates cluster centers for each iteration
until they reach a stable state. The center moves to the densest part of the
cluster and hence intra-cluster distance is minimized. The input vector x is
classified according to the most frequently occurring label among those of
the K nearest sample. The procedure of center update is as follows:
Find out a center that is closest to x in terms the Euclidean distance.
This particular center is updated according to the following rule
ci (t + 1) = ci (t) + α (x − ci (t))
R
Thus the center moves closer to x. If P(x) is the uniform probability of
x, ||x − ci ||2 P(x)dx, α (0 → 1) is gradually decreased to 0.
While centers are updated using unsupervised scheme, the weights can
be updated using any of the least square algorithms like gradient descent
(GD) or recursive least squares (RLS). The RLS [good:91] algorithm is pre-
sented below.
The ith output of the RBFN described earlier, can be written as
xi = ϕ T θ i , i = 1, . . . n
P(k + 1) = P(k) − P(k)ϕ (k)[1 + ϕ (k)T P(k)ϕ (k)]−1 ϕ (k)T P(k) (1.60b)
30
where P(k) ∈ Rl×l is known as covariance matrix. RLS is more accurate and
fast compared to least mean square (LMS) [simon:aft] algorithm.
The simplest approach is to update the centers using gradient descent
algorithm and the weights can be updated using simple LMS algorithm.
Although computational requirement increases by adjusting centers, the
number of centers can be substantially reduced by this approach. The gen-
eralization performance of such a network is much better as compared to
hybrid learning scheme where centers are fixed or learned unsupervised
and the weights are updated using recursive least squares algorithm.
0.3 2
Data point
RBF center
RBF center
0.225
1
0.15
Input 2 : y(k)
Input 2
0.075
0
0
-0.075 -1
-0.15
-2
-0.16 -0.08 0 0.08 0.16 0.24 -1 -0.5 0 0.5 1
Input 1 Input 1 : u(k)
RMS error of 0.018 for training and 0.027 for testing data.
31
1.5 Adaptive Learning Rate
Faster convergence and function approximation accuracy are two key is-
sues in choosing a training algorithm. The popular method for training
a feed-forward network has been the back propagation. One of the main
drawbacks of back propagation algorithm is its slow rate of convergence
and its inability to ensure global convergence. Some heuristic methods like
adding a momentum term to original BP algorithm and standard numerical
optimization techniques using quasi-Newton methods have been proposed
to improve the convergence rate of BP algorithm [haykin:99, sarkar:95].
The problem with quasi-Newton methods are that the storage and memory
requirements go up as the square of the size of the network. The nonlin-
ear optimization technique such as the Newton method, conjugate-gradient
etc. [Chara:92, osowski:96] have been used for training. Though the algo-
rithm converges in fewer iterations than BP algorithm, it requires too much
computation per pattern. Other algorithms for faster convergence include
extended Kalman filtering (EKF) [Iguni:92], recursive least square (RLS)
[Bilski:98] and Levenberg-Marquardt (LM) [Hagan:94, Lera:02]. In order
to overcome the computational complexity in these algorithms, a number
of improvements have also been suggested [Okyay:01, Toledo:05]. How-
ever, these improvements do not bring them closer to back-propagation al-
gorithm as far as simplicity and ease of implementation is concerned.
In a recent work [swagat:06], two novel algorithms have been proposed
on adaptive learning rate using Lyapunov stability theory. The proposed al-
gorithm has exact parallel with the popular BP algorithm where the fixed
learning rate in BP algorithm is replaced by an adaptive learning rate. It
is observed that this adaptive learning rate increases the speed of conver-
gence. Although Yu et.al. [Yu:02] and Yu et. al. [poznyak:01] have used
Lyapunov function based weight update algorithms, none of them address
the issue of computation of adaptive learning rate in a formal manner. The
nature of convergence has also not been discussed in their work.
It is shown in section 1.3.5 that a momentum term may be added to
BP algorithm in order to speed up its convergence rate. Such a term arises
naturally in Lyapunov function based algorithms and thus it provides a the-
oretical justification for such kind of modifications.
32
input vector is x p , then the network output is given by
y p = f (w, x p ) p = 1, 2, . . . N. (1.61)
The usual quadratic cost function which is minimized to train the weight
vector w is given by
1 N
E = ∑ (ydp − y p )2 (1.62)
2 p=1
In order to derive a weight update algorithm for such a network, we con-
sider a Lyapunov function candidate as
1
V1 = (ỹT ỹ) (1.63)
2
where ỹ = [y1d −y1 , ....., ydp −y p , ....., yN
d −y ] . As can be seen, in this case the
N T
∂y
V˙1 = −ỹT ẇ = −ỹT Jẇ (1.64)
∂w
where
∂y
J= ∈ RN×m (1.65)
∂w
Theorem 1.1. If an arbitrary initial weight w(0) is updated by
Z t′
′
w(t ) = w(0) + ẇdt (1.66)
0
where
∥ ỹ ∥2 T
ẇ = J ỹ (1.67)
∥ JT ỹ ∥2
then ỹ converges to zero under the condition that ẇ exists along the conver-
gence trajectory.
V˙1 = − ∥ ỹ ∥2 ≤ 0 (1.68)
where V˙1 < 0 for all ỹ ̸= 0. If V˙1 is uniformly continuous and bounded, then
according to Barbalat’s lemma [krstic:02] as t → ∞, V˙1 → 0 and ỹ → 0.
The weight update law given in equation (1.67) is a batch update law.
Analogous to instantaneous gradient descent (or BP) algorithm, the instan-
33
taneous LF I learning algorithm can be derived as
∥ ỹ ∥2
ẇ = Jp T ỹ (1.69)
∥ Jp T ỹ ∥2
∥ ỹ ∥2
ηa = µ (1.73)
∥ Jp T ỹ ∥2
This is the most remarkable finding of the work [swagat:06]. Earlier, there
have been many research papers concerning the adaptive learning rate [haykin:99,
Qiu:92, cheng:93, tol:90]. However, the computation of adaptive learning
rate using Lyapunov function approach is a key contribution in this field.
The Theorem 1.1 states that the global convergence of the learning al-
gorithm (1.67) is guaranteed provided ẇ exists and is non-zero along the
convergence trajectory. This, in turn, necessitates ∥ ∂∂Vw1 ∥=∥ J T ỹ ∦= 0. The
condition ∥ ∂∂Vw1 ∥= 0 represents local minima of the scalar function (1.63).
Thus, the theorem 1.1 says that the global minimum is reached only when
local minima are avoided during training. Since instantaneous update rule
introduces noise, it may be possible to reach global minimum in some cases,
34
however, the global convergence is not guaranteed.
When the algorithm is implemented to learn a XOR map, the Figure
1.17 shows how the learning rate naturally evolves and converges to zero
as learning is over. Readers are referred to the work [swagat:06] for more
details.
50
LF - I : XOR
40
Learning rate
30
20
10
0
0 100 200 300 400
Number of iterations
35
y(t + 1)
MLN
(a)
w33
x3
w31 3 w35
x1 (t)rs w53
w51 x5
w32
w34 w43
w41 w52 5 y(t + 1)
w54
x2 (t)rs
w55
w42 4 w45
w44 x4
(b)
Figure 1.18: (a) Partially Connected Recurrent Networks (b) Fully Con-
nected Recurrent Networks
36
Consider a system expressed in discrete dynamics:
where y(t) may be a scalar variable or a vector, f (·) and g(·) are two arbitrary
nonlinear functions of y(t − 1). The feed-forward network will learn this
dynamics as
ŷ(t) = fˆ(y(t − 1), u(t − 1)) (1.75)
where y(t − 1) is the actual system output at t − 1 instant. Whereas the
recurrent network learns this map as
which indicates that the network has internal memory, i.e. the present out-
put state is a function of previous output of the network. Figure 1.19 sum-
marizes the difference between feed-forward network and feedback net-
work.
si (t + 1) = ∑ wi j x j (t)
j
37
where t represents temporal argument. The response of the above unit
can be expressed as
x4 (t + 1) = f (s4 (t + 1))
5
s4 (t + 1) = ∑ w4ixi(t)
i=1
The response of the output unit (5th unit) can now be given as
38
w
x(t) y(t + 1)
g
(a)
Example 1.7. Unfold the fully recurrent network given in Figure 1.18 (b)
in time.
Solution
The unfolded network until time-step t = 4 is shown in Figure 1.21.
x1 rs rs rs rs rs
x2 rs rs rs rs rs
x3
x4
x5
t =0 t =1 t =2 t =3 t =4
39
Example 1.8. Enumerate steps for training the unfolded network of the
simple recurrent network given in Figure 1.20.
Solution
The steps to proceed further is described as follows:
1. Compute the response of the sequence y(1) to y(5), given the sequence
x(0) to x(4) and y(0).
3. Compute e(4), e(3), e(2) &e(1) and proceed the same way until the
first node t = 1
∆w4 = ηδ4 x(3)
∆g4 = ηδ4 y(3)
where δ4 = y(4)(1 − y(4))[δ5 g + e(4)] Similarly,
∆w3 = ηδ3 x(2)
∆g3 = ηδ3 y(2), δ3 = y(3)(1 − y(3))[δ4 g + e(3)]
∆w2 = ηδ2 x(1)
∆g2 = ηδ2 y(1), δ2 = y(2)(1 − y(2))[δ3 g + e(2)]
∆w1 = ηδ1 x(0)
∆g1 = ηδ1 y(0), δ1 = y(1)(1 − y(1))[δ2 g + e(1)]
5
w new
=wold
+ ∑ ∆wi
i=1
5
gnew = gold + ∑ ∆gi
i=1
40
u(t) y(t + 1)
w3
w1 z−1
w2
z−2
Figure 1.22: A recurrent network that can learn the linear dynamics given
in (1.77).
w3 w3 w3
y(1) w1 y(2) w1 y(3) y(n) w1 y(n + 1)
b b b
··· b b
w2 w2 w2
1 1 1
b b b
··· b b
41
Input
Output
2
I/O data
0
−1
−2
0 20 40 60 80 100
time step
The evolution of the weights during training is shown in the right of Figure
1.25.
w1
0.35 w2
0.25
w3
0.3
0
0.25
RMS error
weights
0.2 −0.25
0.15
−0.5
0.1
−0.75
0.05
0 −1
0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000
no. of epochs no. of epochs
Figure 1.25: System Identification using BPTT: Left figure shows the error
plot and right figure shows evolution of the weights
42
algorithm is an off-line technique. In contrast in RTRL, the gradient infor-
mation at t is forward propagated to the next time step t + 1. The algorithm
can be implemented in real-time.
∂ E(t + 1)
w(t + 1) = w(t) − η
∂w
∂ E(t + 1)
g(t + 1) = g(t) − η .
∂g
Differentiating E with respect to the synaptic weight w
∂ E(t + 1) ∂ y(t + 1)
= −[yd (t + 1) − y(t + 1)] .
∂w ∂w
Let us define two variables:
∂ y(t + 1)
Pw (t + 1) = with Pw (0) = 0 and
∂w
∂ y(t + 1)
Pg (t + 1) = with Pg (0) = 0
∂g
43
1
Since y(t + 1) = 1+e−s(t+1)
, one can write
∂ y(t + 1)
= y(t + 1)(1 − y(t + 1)).
∂ s(t + 1)
∂ s(t+1)
Similarly ∂w can be computed as
∂ s(t + 1) ∂
= [wx(t) + gy(t)]
∂w ∂w
∂ y(t)
= x(t) + g = x(t) + gPw (t).
∂w
∂ y(t + 1)
Therefore, = y(t + 1)(1 − y(t + 1))[x(t) + gPw (t)].
∂w
In a similar fashion one can compute
∂ y(t + 1)
= y(t + 1)(1 − y(t + 1))[y(t) + gPg (t)]
∂g
which implies
∂ E(t + 1)
= −[yd (t + 1) − y(t + 1)]y(t + 1)
∂w
(1 − y(t + 1))[x(t) + gPw (t)]
∂ E(t + 1)
= −[yd (t + 1) − y(t + 1)]y(t + 1)
∂g
(1 − y(t + 1))[y(t) + gPg (t)].
where the values of Pw (t +1) and Pg (t +1) are computed using the following
set of relations:
44
w22
x2
2 w24
w21
w42
x4
x1 (t)rs
w43 4 y(t + 1)
w31 w44
3 w34
w33 x3
Example 1.9. Derive the RTRL algorithm for the network shown in Figure
1.26.
Solution
Forward response of the network is computed as follows:
∂ E(t + 1)
w jk (t + 1) = w jk (t) − η
∂ w jk
∂ E(t + 1) ∂ y(t + 1)
= −(yd (t + 1) − y(t + 1)), .
∂ w jk ∂ w jk
∂ y(t + 1) ∂ x4 (t + 1) 4
Let us define, = = Pjk (t + 1). One can write
∂ w jk ∂ w jk
4
Pjk (t + 1) = f ′ (s4 (t + 1))[w42 Pjk
2 3
(t) + w43 Pjk 4
(t) + w44 Pjk (t) + δ jk xk (t)]
45
the derivative of f with respect to its argument. Similarly one can write
3
Pjk (t + 1) = f ′ (s3 (t + 1))[w33 Pjk
3 4
(t) + w34 Pjk (t) + δ jk xk (t)]
2
Pjk (t + 1) = f ′ (s2 (t + 1))[w22 Pjk
2 4
(t) + w24 Pjk (t) + δ jk xk (t)].
Again
4
Ppq (t + 1) = f ′ (s4 (t + 1)) = [w42 Ppq
2 3
(t) + w43 Ppq 4
(t) + w44 Ppq (t) + δP4 xq (t)]
∂ E(t + 1)
w31 (t + 1) = w31 (t) − η
∂ w31
∂ E(t + 1) ∂ y(t + 1)
= −[yd (t + 1) − y(t + 1)]
∂ w31 ∂ w31
= −[y (t + 1) − y(t + 1)]P31 (t + 1).
d 4
Therefore
where
4
P31 (t + 1) = y(t + 1)[1 − y(t + 1)][w42 P31
2 3
(t) + w43 P31 4
(t) + w44 P31 (t)].
i (t) = 0 for i = 2, 3, 4.
Similarly one can proceed for P31
46
as
∂ E(t + 1)
wi (t + 1) = wi (t) − η for i = 1, 2, 3.
∂ wi
The partial derivatives of the cost function with respect to the weights are
computed as follows:
∂ E(t + 1)
= −(yd (t + 1) − y(t + 1))Pw1 (t + 1)
∂ w1
where
∂ y(t + 1)
Pw1 (t + 1) = = y(t) + w1 Pw1 (t) + w2 Pw1 (t − 1).
∂ w1
Similarly
∂ E(t + 1)
= −(yd (t + 1) − y(t + 1))Pw2 (t + 1)
∂ w2
∂ y(t + 1)
Pw2 (t + 1) = = y(t − 1) + w1 Pw2 (t) + w2 Pw2 (t − 1)
∂ w2
∂ E(t + 1)
= −(yd (t + 1) − y(t + 1))Pw3 (t + 1)
∂ w3
∂ y(t + 1)
Pw3 (t + 1) = = u(t) + w1 Pw3 (t) + w2 Pw3 (t − 1)
∂ w3
100 input-output data points, as generated for the same example using BPTT,
have been used to train the network. The training data are shown in Figure
1.24. Using RTRL for one sequence of time, the final weights are found to
be
w1 = −0.5000 w2 = −0.9999 and w3 = 0.5010
Figure 1.27(a) shows the RMS error vs number of epochs while Figure
1.27(a) shows evolution of the weights during training.
47
Final RMS error=0.0009 (RTRL) Training Algorithm : RTRL
1.4
1.2 0.6
1
0.2 w
1
RMS error
w
0.8 2
weights
w3
0.6 −0.2
0.4
−0.6
0.2
0 −1
0 500 1000 1500 2000 0 500 1000 1500 2000
no. of epochs no. of epochs
(a) (b)
Figure 1.27: System Identification using RTRL: (a) error plot (b) evolution
of the weights
Winning Neighborhood
neuron
Two-dimensional
lattice of neurons
Connections between
input and each neuron
Input x (N × 1)
Figure 1.28: A two dimensional self organizing feature map. (By updating
all the weight connecting to a neighborhood of the target neurons, it en-
ables the neighboring neuron to become more responsive to the same input
pattern. Consequently, the correlation between neighboring nodes can be
enhanced. Once such a correlation is established, the size of a neighbor-
hood can be decreased gradually, based on the desire of having a stronger
identity of individual nodes.)
48
• Competition: For each input pattern, the neurons in the network
compute their respective values of a discriminant function. The neu-
ron with the largest value of that function is declared winner.
Competitive Process
Let m is the dimension of the input (data) space and weight vector. Let a
randomly chosen input pattern (vector) is
x = [x1 , x2 , ..., xm ]T .
w j = [w j1 , w j2 , ..., w jm ]T , j = 1, 2, ..., l
Cooperative Process
The winner neuron tends to excite the neurons in its immediate neighbor-
hood more than those farther away from it. Let h j,i denotes the topological
neighborhood centered on winning neuron i and di, j denotes the lateral dis-
tance between winning neuron i and excited neuron j.
49
A typical choice of h j,i that satisfies these requirements is the Gaussian
function as shown in Figure 1.29. The expression of a Gaussian neighbor-
hood function is given as
0.9
0.8
0.7
0.6
j,i
0.5
h
0.4
0.3
0.2
0.1
0
−50 0 50
d j,i
!
d 2j,i
h j,i(x) (n) = exp − , n = 0, 1, 2, ...
2σ 2 (n)
σ0 =initial value of σ
τ1 =time constant of width function
n =number of training step
One-Dimensional Lattice
d j,i = | j − i|
Consider the above one dimensional Kohonen lattice. Let index of winning
neuron i is 3 and index of two neighborhood neurons are 2 and 5. Thus,
d8,5 = |2 − 3| = 1
d3,5 = |5 − 3| = 2.
50
1 2 3 4 5 6 7 8
Winning neuron
Two-Dimensional Lattice
For two-dimensional lattice d 2j,i is measured as
where
1 2 3 4 5
1
Winning neuron
Adaptive Process
Weights associated with winning neuron and its neighbors are updated as
per a neighborhood index. Winning neuron is allowed to be maximally
benefited from this weight update while the neuron that is farthest from
the winner is minimally benefited. The change in weight in each training
51
step is given by
This equation will move the weight of winning neuron wi towards the input
vector x.
m = 2
x = [x1 , x2 ]T
w j = [w j,1 , w j,2 ]T ; j = 1, 2, ..., 100
52
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
0.04
0.03
0.02
0.01
−0.01
−0.02
−0.03
−0.04
−0.04 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 0.04
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
53
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
(−0.04 < w j,1 < 0.04 and − 0.04 < w j,2 < 0.04). The input x is uniformly
distributed in the region: (0 < x1 < 1 and 0 < x2 < 0.3), (0 < x1 < 0.3 and 0.3 <
x2 < 1). The training is done for 6000 iterations. Figure 1.33 shows the in-
put data space. Figure 1.31 shows the network weights before training and
Figure 1.34 shows that the weights of the network preserve the topology of
the input space.
54
trained with a two dimensional input vector x. Thus,
m = 2
x = [x1 , x2 ]T
wi, j = [wi, j,1 , wi, j,2 ]T ; i = 1, 2, ..., 10 j = 1, 2, ..., 10
The weights are initialized from a random set (−0.08 < wi, j,1 < 0.08 and −
0.08 < wi, j,2 < 0.08). The input x is uniformly distributed in the region
(0 < x1 < 1 and 0 < x2 < 1). Figure 1.30 shows the input data space. Figure
1.31 shows the network weights before training and Figure 1.35 shows that
the weights of the network preserve the topology of the input space. Next
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
we will show the topology preservation of an L shaped input space with the
same two-dimensional lattice. Figure 1.33 shows the input data space. Fig-
ure 1.31 shows the network weights before training and Figure 1.36 shows
that the weights of the network preserve the topology of the input space.
55
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
where u(·) ∈ Rm and x(·) ∈ Rn are input and state vectors respectively. When
the function f in (1.79) is unknown, the problem of system identification can
be formally stated as
Suppose that the dynamics of a causal, time invariant discrete time plant is
described by the equation (1.79), where the input u(·) is a uniformly bounded
signal. The plant is assumed to be stable. Then a feed-forward network N(·)
approximately represents the function f if N(·) predicts an output x̂(k + 1),
given the input sequence u(k) and previous system state x(k) such that ∑Pk=1 ∥x(k +
1) − x̂(k + 1)∥ ≤ ε for a small desired ε > 0 where P is number of discrete-time
instants.
Figure 1.37 shows learning framework for generating a neural system
model of (1.79). One can select either an MLN or an RBF network for neu-
ral representation. When the system is in input affine form, the governing
equation becomes
56
x(k)
Neural Network x̂(k + 1)
u(k)
e(k + 1) _ x(k + 1)
+
f̂(x(k))
x(k)
NN ∑ x̂n (k + 1)
+
+ −
∑ xn (k + 1)
+
ĝ(x(k))
NN ∏
en (k + 1)
u(k)
where g has a form g(u) = 0.6 sin(u) + 0.3 sin(3π u) + 0.1 sin(5π u). The un-
forced linear system is asymptotically stable. Derive a neural model using
a series-parallel representation
57
and 10 neurons respectively. The weights of the neural network N are ad-
justed using instantaneous back propagation algorithm with a learning rate
of 0.25. The input is randomly chosen from the uniform interval [−1, 1].
The identification is carried out for 500 time steps and the result is shown in
Figure 1.39. The RMS error between the output of the model and the plant
is found to be 0.003.
8
Plant output
Model output
6
Plant and model output
−2
−4
0 100 200 300 400 500
Time step
Figure 1.39: Response of the plant and the neural model after training
Solution
The identification is carried out for random inputs u1 and u2 which are uni-
formly distributed in the interval [−1, 1]. N 1 and N 2 correspond to two
three-layered networks with hidden units of 15 and 10 respectively. The
learning rate is taken as 0.1. Figure 1.40 shows the identification results.
Example 1.14. Lorentz Attractor Problem
Solution
58
4 3
Plant output 1 Plant output 2
3 Model output 1 Model output 2
Output 1 of the plant and the model
1
1
0
−1 0
−2
−1
−3
−4
−2
−5
−6 −3
0 20 40 60 80 100 0 20 40 60 80 100
Time step Time step
(a) (b)
59
Figure 1.42: x − y projection of the Lorentz Attractor
for a specific problem. For our study the values of the parameters are taken
as σ = 16 , r = 45.92 and b = 4. The solutions of this system of differential
equations cannot be found by methods of analytic integration and hence the
solutions are found by the method of Runge-Kutta Fourth Order ODE solver
algorithm which numerically solves for the solution with a good amount of
accuracy.
The solution of these differential equations traces out a path in the three
dimensional phase space. The three dimensional plot of the Lorentz Attrac-
tor set solved by the Runge-Kutta method is shown in Figure 1.41. A plot of
the projection on the x − y plane is shown in Figure 1.42. Similarly a plot of
the Lorentz Attractor Set on the x-axis is also shown in Figure 1.43. It shows
the variation of the x-signal with time. All the three plots are obtained for
a set of 30000 data points, numerically calculated using the fourth order
Runge-Kutta differential solving model.
60
viations. The learning in this network is done in two stages. In the first
stage, only center learning is done. We start by placing some random cen-
ters in the clubbed input-output space. Then a clustering algorithm is used
to modify their positions such that the input-output vectors are equally clus-
tered among the centers of the neurons. After this the a projection of the
clubbed input-output vectors is taken on the input space, which provides
the network with the required centers. Then a gradient descent algorithm
is applied to learn the weight vectors.
For testing the performance of the Radial Basis Function Network, a 3-
input and 3-output neural network was used to first train and then predict
the Lorentz Attractor. This network consisted of 700 centers. The values of
σ (the neighborhood width) and η (the learning rate) were taken as 0.3 and
0.25 respectively. The training set consisted of 10, 000 data points which
Figure 1.43: Plot of the x-signal of the actual Lorentz Attractor with time
0.8
’a’ u 1
0.6
0.4
0.2
-0.2
-0.4
0 5000 10000 15000 20000 25000 30000
Figure 1.44: Plot of the x-signal coming from the trained RBFN.
were randomly given to the network. After training the network was used to
predict 30, 0000 data points of the Lorentz Attractor. The results of this are
shown in figures here. The Figure 1.44 denotes the x-signal with time. This
figure can be compared with Figure 1.43 to make a qualitative judgment
61
of the effectiveness of the network. The mean square error produced in x-
signal estimation is 0.0791083, for the y-signal is 0.062171 , for the z-signal
is 0.0588676, while the total error is 0.116514 . It is seen that this network
architecture has been successful in capturing the dynamics of the Lorentz
attractor to some extent, but certainly it does leave scope for improvement.
This network demonstrates the power of the general class of Radial Basis
Neural Networks. The RBFN performance can be improved if the clus-
Figure 1.45: 3-D plot of the predicted Lorentz System using RBFN with
higher order clustering
.
Figure 1.46: x-signal plot of the predicted Lorentz System using RBFN with
higher order clustering.
62
first stage, only center learning is done. We start by placing some random
centers in the clubbed input-output space. Then a clustering algorithm is
used to modify their positions such that the input-output vectors are equally
clustered among the centers of the neurons. After this the a projection of the
clubbed input-output vectors is taken on the input space , which provides
the network with the required centers. Then a gradient descent algorithm
is applied to learn the weight vectors. The number of centers in the network
was 500 and the values of σ (the neighborhood width) and η (the learning
rate) were taken as 1.5 and 0.30 respectively. Here also a training set of
10, 000 data points was given to the network randomly. After training the
network was used to predict 30, 0000 data points of the Lorentz Attractor
given the same initial points. The results of this are shown in the plots. Fig-
ure 1.45 shows the three dimensional plot of the predicted Lorentz attractor
after the training was complete. Similarly Figure 1.46 shows the behavior
of the x-signal obtained from the RBFN using higher order clustering. The
mean square errors(MSE) for this network were as follows - MSE in x-signal
= 0.0632505; MSE in y-signal = 0.0818714; MSE in z-signal = 0.043263;
while the total MSE = 0.0112162.
63
point, (uinp , uop ) from these set of points and update Ai j in the following
manner
i j = Ai j + η (t)g(uinp , wi j )∆Ai j
Anew old
(1.82)
where
∆Ai j = ∥∆v∥−2 I − Ai j ∆v ∆vT
∆v = uop − θ i j
η = learning rate
∥uinp − wi j ∥2
g(uinp , wi j ) = exp −
2σ 2
I = 3 × 3 Identity Matrix.
Therefore, for any given input u the corresponding output ui j from each
neuron is given by
ui j = θ i j + Ai j u − wi j . (1.83)
We find the final output of the whole network by doing a weighted aver-
age over all the outputs from various neurons. The weight depends upon
the distance of the neuron to the “winner” neuron (corresponding to this
input). Thus, we get :
where
−||(i, j) − (i0 , j0 )||2
h(i, i0 ,t) = exp( )
2σ (t)2
Here, (i0 , j0 ) represent the “winner” neuron to the input u. In this particu-
lar problem σ (t) and η (t) are both high initially and then they are decreased
to very low positive value by equation 1.85.
t
ε f inal tmax
ε (t) = εinitial (1.85)
εinitial
where ε ∈ {σ , η }.
The use of the Hybrid structure of the Kohonen SOM as function approxi-
mation technique is described in Section 1.9. We will use this approach to
64
model the Lorentz Attractor as given in Example 1.14. A two-dimensional
Figure 1.47: Plot of the predicted Lorentz Attractor on the X-Y plane using
2-dimensional Kohonen Lattice.
Figure 1.48: Plot of the x-signal of the predicted Lorentz Attractor using
2-dimensional Kohonen Lattice.
network is considered for the purpose which consists of a 70×70 lattice and
the values of σinitial , σ f inal , ηinital and η f inal are taken as 40, 0.01, 0.90 and
0.01 respectively. The network is trained over a training set of 10000 data
points. Then this network is used to predict the Lorentz Attractor for 30000
data points. The results obtained for this network are shown in the next set
of figures and plots. From these plots the topological preserving nature of
the Kohonen Lattice is very well seen, in spite of the fact that this learning
was done in the higher dimensional space after clubbing the input-output
vectors. The next plot, Figure 1.49 shows the positioning of the weight vec-
tors of the neurons of the two dimensional Kohonen Lattice in the input
space after learning. The topology preservation is very clearly seen which
indicates that the network has efficiently learnt the data and has effectively
captured the dynamics of the time series.
65
Figure 1.49: Plot of the weight vectors in the input space after learning.
Figure 1.50: Plot of the predicted Lorentz Attractor on the X-Y plane using
a 3-dimensional Kohonen Lattice.
This network is very similar to that of the network described above in Sec-
tion 1.9. The only difference between this network and the one above is in
the dimension of the Kohonen Lattice. In this case a 3-dimensional lattice is
used instead of a 2-dimensional one. The output vectors are also learned by
the use of a Higher Order Clustering Algorithm, i.e., the input and output
vectors are clubbed and then the Kohonen Learning Rule is applied to these
vectors as described earlier. The network consists of a 15 × 12 × 12 lattice
and the values of σinitial ,σ f inal ,ηinital and η f inal are taken as 20, 0.01, 0.90
and 0.005 respectively. Once the training is over, the projection of these
vectors on the input space become the weight vectors of the neurons, while
the projected vectors on the output space form the output of the respective
neurons. The network is trained over a training set of 10000 data points.
Then this network is used to predict the Lorentz Attractor for 30000 data
points. It is intuitively felt that a 3-dimensional Kohonen Lattice should
perform better in predicting and capturing the dynamics of the Lorentz At-
tractor. This is indeed the case as demonstrated by the results obtained.
66
Figure 1.51: Plot of the predicted Lorentz Attractor on the X plane using a
3-dimensional Kohonen Lattice.
Figure 1.52: Plot of the weight vectors in the X-Y plane using 3-dimensional
Kohonen Lattice
The plots of the obtained results are shown here in Figures 1.50, 1.51 and
1.52. Readers should be able to appreciate the use of SOM network in sys-
tem identification as these results show that even chaotic attractors can be
modelled using such networks.
y = f (x), x ∈ Rn , y ∈ R m . (1.86)
67
Figure 1.53: KSOM network for system identification.
using first order Taylor series expansion. Given any input vector x0 ,
y0 = f (x0 ). (1.87)
Using first order Taylor series expansion, the output y can be expressed lin-
early around x0 as follows:
δf
y = y0 + |x=x0 (x − x0 ) (1.88)
δx
Let’s consider the following Kohonen lattice where each neuron is associ-
ated with the following linear model:
yγ = yγ + Aγ (x − wγ ) (1.89)
68
is the neighborhood function with respect to the winning neuron. Thus the
nonlinear map y = f (x) can be approximated as:
∑ hγ yγ
y= (1.90)
∑ hγ
where hγ = e−dγ /2σ and dγ is the lattice distance between the winning neu-
2 2
The final expression for the network response can be given as:
As shown in Figure 1.54, the network has a collective response y when ex-
cited by the input pattern x where each neuron computes its own response
linearly. The readers must know that parameters associated with each neu-
ron wγ , Aγ , and yγ are unknown and are randomly initialized with very small
values. We will now derive the update laws for these parameters. Given that
wγ is the natural weight vector, its update will follow the same Kohonen
weight update algorithm:
wγ = wγ + η hγ (x − wγ ). (1.92)
Let the cost function be E = 12 ỹT ỹ, ỹ = yd − y and yd is the desired response
given x while y is the network response. The update law for the yγ can be
derived using gradient descent:
δE δy
= −ỹT
δ yγ δ yγ
h
γ
= −ỹT . (1.93)
∑ hγ
69
For the update law of Aγ , the gradient term is derived as:
δE δy
= −ỹT
δ Aγ δ Aγ
hγ
= −ỹT (x − wγ )
∑ hγ
hγ
=− (x − wγ )T . (1.95)
∑ hγ
70
Figure 1.54: (a) Error convergence during the training; (b) y1 versus yd1 dur-
ing the testing; (c) y2 versus yd2 during the testing.
1.10 Summary
This chapter on neural networks is self-sufficient for the readers to grasp
the intelligent control concepts using neural networks covered in this book.
Network architectures and associated learning algorithms for multi-layered
networks, radial basis function networks, recurrent networks and SOM net-
works have been presented in a tutorial manner in this chapter. The use
of these networks for function approximations has been described through
many illustrative examples. The section on system identification allows
readers to understand the process of function approximation for nonlinear
systems including a chaotic system. The section on adaptive learning rate
shows that control theoretic concepts can be applied to network learning.
This section shows that both control theory and neural learning research
complement each other.
1.11 Exercises
1. Find the global minima of the following:
71
x1 x2 x3
Set up two sets of data, one for training and another for testing. Use
the training data set to compute the synaptic weights of the network.
Evaluate the computation accuracy of the network by using the test
data. Use a single hidden layer but variable number of hidden neu-
rons. Note the effect of the change in size of the hidden layer on the
network performance.
6. The MLN in the figure 1.56, uses a single hidden layer with sigmoidal
activation function.
w21
x1 w22
x2
x3 w23
x4
w24
(a) Find out the response of the network using forward pass.
(b) Find out the update law for the weights w21 and w24 .
7. Construct an RBFN with four hidden units, with each radial-basis
function center being determined by each piece of input data, to ob-
tain an exact solution to the XOR problem. The four possible input
72
patterns are defined by (0,0), (0,1), (1,1). (1,0), which represent the
cyclically ordered corners of a square.
(a) Construct the interpolation matrix for the resulting RBFN. Hence,
compute the inverse matrix Φ−1 .
(b) Calculate the linear weights of the output layer of the network.
x1 = [−7 +2]T , y1 = +1
x2 = [−4 −2]T , y2 = −1
9. For an RBFN, find the update rule for the centers using gradient de-
scent technique while the basis function is assumed to be inverse
quadratic, i.e.,
1
Φ(r) = 1
(r2 + c2 ) 2
10. Realize an XOR function using radial basis function network whose
basis function is given as Ψ(r) = r2 log(r).
11. An RBFN has 5 radial centers. It has one input u and one output y.
The weight vector is denoted by w. The center of each unit is denoted
as ci . The radial basis function is Gaussian.
y(k)
y(k + 1) = u3 (k)
1 + y2 (k)
The input sequence is given as {0.5, 0.2, 0.05, 0.8, 0.35}. The radial
function is Gaussian. Draw the architecture of the RBFN network
with 5 radial centers. What are the radial centers? Find weights us-
ing peudo-inverse technique. Find weights by recursive gradient al-
gorithm. Compare results.
73
13. Train an RBFN to learn the quadratic function y = u2 . The network
has 50 centers and are randomly chosen between 0 and 1. The input
range is [0, 1]. The radial function used is type Gaussian ϕ (r) = e−r
2
(a) Find the response of the network when the input takes a value
from the set [0.2, 0.5, 0.6, 0.9]. Does the result indicate that the
network has learned above mapping.
(b) One of the way to understand the proper training is to test for
both forward mapping (response to a given input) and inverse
mapping (predict input for a given response). For inverse map-
ping, the iterative algorithm is given as
∂ E(t)
u(t + 1) = u(t) − η
∂ u(t)
i. Find the explicit expression for ∂ E/∂ u the network given.
ii. Using the iterative inversion algorithm predict input when
the desired response takes a value from the set [0.09, 0.25, 0.36, 0.64].
Select the initial value randomly from [0, 1].
iii. Repeat above steps while updating both centers and weights.
x1 x2 x3
15. Generate a set of random inputs u(k), k = 1, 2, ..., 200. Obtain the
training data set defined by the relation:
1
y(k + 1) = .u(k)
1 + y2 (k)
74
u(k) w y(k + 1)
Figure 1.58:
w2
u(t) w1
y(t + 1)
w3
w4
Figure 1.59:
ml 2 θ̈ + mgl cos θ = τ
• Multi-layered network
• Radial Basis Function Network
Generate 1000 pairs of new test data and verify the identification re-
sult.
75
18. The state space model of a magnetic ball suspension is given by
dx1 (t)
= x2 (t)
dt
dx2 (t) x2 (t)
= g− 3
dt Mx1 (t)
dx3 (t) R 1
= − x3 (t) + v(t)
dt L L
where x1 (t) = y(t) is the ball position in meters, x2 (t) = ẏ(t) is the
ball velocity, x3 (t) = i(t) is the winding current and v(t) is the input
voltage. The system parameters are M = 0.1 kg, the mass of the ball,
g = 9.81 m/s2 , the gravitational acceleration, R = 50 Ω, the resistance
of the winding, L = 0.5 H, the winding inductance . The position
of the ball is measured by a position sensor. Express the system in
a discrete dynamical form and identify the dynamics of the system
using
• Multi-layered network
• Radial Basis Function Network
You should generate separate data sets for identification and model
validation. Compare the results of identification.
θ̇ = ω
J ω̇ = −Km ia sin(N θ ) + Km ib cos(N θ ) − Bω − TL
La i̇a = −Ra ia + Kb ω sin(N θ ) + va
Lb i̇b = −Rb ib − Kb ω cos(N θ ) + vb
76
J = 0.0733 kg − m2 , B = 0.002N − m − sec/rad
m = 0.4014 kg, N = 50, l = 1m
La = Lb = 0.7 H, Ra = 0.9 Ω, Rb = 1.2Ω
Km = 0.25N − m/A, Kb = 0.025V − sec/rad, g = 9.81 m/sec2
Find out the discrete time representation of the system and identify
the dynamics using a radial basis function network.
where u(k) is the input flow,ph(k) is the liquid level of the tank (output
of the system) , A(h(k)) = ah2 (k) + b is the cross-sectional area of
the tank at kth instant. The input u(k) can be both positive or negative,
i.e., it can pull liquid out of the tank as well as put it in the tank. The
parameters a and b are given as a = 1 and b = 2. Considering T =
0.1 identify the system dynamics using a recurrent network. Employ
both BPTT and RTRL for learning the network.
77
Index
78