0% found this document useful (0 votes)
8 views78 pages

7 Neural - Network

Chapter 1 discusses the use of data-driven methods for modeling control applications, highlighting the challenges of deriving mathematical models and the importance of identifying separatrices in system behavior. It explains linear separatrix modeling, least square solutions, and minimum norm solutions, providing examples such as mobile robot wheel velocities and Iris flower species classification. The chapter also includes Python code snippets for data handling and model training.

Uploaded by

s24074
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views78 pages

7 Neural - Network

Chapter 1 discusses the use of data-driven methods for modeling control applications, highlighting the challenges of deriving mathematical models and the importance of identifying separatrices in system behavior. It explains linear separatrix modeling, least square solutions, and minimum norm solutions, providing examples such as mobile robot wheel velocities and Iris flower species classification. The chapter also includes Python code snippets for data handling and model training.

Uploaded by

s24074
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

Chapter 1

Neural Networks

Many control applications need mathematical model of the plant. Writing


one is considered a hard task. It requires efforts and time of the process
and control engineers. One way to overcome such problem is to use data-
driven methods. Such methods fit the collected experimental data to a sys-
tem model categorizing it to a specific class. The model derived can then
be used to design a required controller for the system. Though a difficult
task, a good fit for the model is one which includes only those dynamics
of the system that are of interest for the control specifications. There are
data-driven methods which allow design of controller belonging to a cer-
tain class, without the need of an identified model of the system.
Consider the task of modelling elements of a given set into two groups
based on a prescribed rule. The figure below depicts two instances of such
a problem. One needs to identify the trajectories that separate the phase
plane into regions. Such a trajectory is called a separatrix or in other words
it is the boundary separating two modes of behaviour in a differential equa-
tion. On the left, the task is to identify the separatrix between two linearly
separable classes (i.e. the separatrix is a simple straight line) while, on the
right, the two classes are nonlinearly separable (i.e. the separatrix is not a
simple straight line).

1.1 Linear Separatrix


Based on the data, these modes can be developed into model. Hence, linear
separatrix can be devised to have linear model from the given data. The
data, therefore, can be represented in terms of following expression:

Aw = y; A ∈ Rn×m , w ∈ Rm , and y ∈ Rn , (1.1)

where A is the system/model matrix, w is the weight vector associated with


the model, and y is the data or required solution from the model. There are

1
Figure 1.1: Examples of linear and nonlinear modelling problems.

three possibilities for the solution of (1.1):

i n = m: Exact solution.

ii n > m: No exact solution. Least square solution exists

iii n < m: Many exact solutions exist.

For cases above, approaches to find the solution are:

i Exact solution: w = A−1 y.

ii No exact solution: Least Square solution.

iii Many exact solutions exist: Minimum Norm solution.

The Least Square and Minimum Norm solutions are discussed below.

1.1.1 Least Square solution


For the expression defined in (1.1), a least square solution is derived using
error cost function E as
1
E = ỹ⊤ ỹ, (1.2)
2
where ỹ = Aw − y. The solution of parameter w can be obtained from

δE
= 0. (1.3)
δw

2
On using (1.2) in (1.3) gives

(Aw − y)⊤ A = 0 (1.4)

which on further simplification results in

w = (A⊤ A)−1 A⊤ y. (1.5)

Example 1.1. The velocities of left and right wheels of a mobile robot is
given in Table I.

Figure 1.2: A mobile robot.

Sonar s Left Wheel Ve- Right Wheel Ve-


locity vl locity vr
60cm 10cm/sec 10cm/sec
50cm 9cm/sec 9cm/sec
30cm 7cm/sec 4cm/sec 7cm/sec
20cm 4cm/sec 0cm/sec 4cm/sec

The two models for any of the wheel velocities are written as

v1 = w1 s + w0 and v2 = w2 s2 + w1 s + w0 . (1.6)

Find the weights w of v1 ([w1 w0 ]⊤ ) and v2 ([w2 w1 w0 ]⊤ ) wheel velocity mod-


els, respectively?
Solution. The parametric models (v1 and v2 ) of left wheel are written as

0.6w1 + w0 = 10 0.36w2 + 0.6w1 + w0 = 10


0.5w1 + w0 = 9 0.25w2 + 0.5w1 + w0 = 9
0.3w1 + w0 = 7 0.09w2 + 0.3w1 + w0 = 7
0.2w1 + w0 = 4 0.04w2 + 0.2w1 + w0 = 4.

3
On evaluating w parameters for both the models using (1.5) are

Model 1: w = [14 1.9]⊤ and Model 2: w = [−33.3333 40.6667 − 2.6000]⊤ .

The values of ŷ obtained using (1.1) are

ŷ1 = v̂l = [10.3 8.9 6.1 4.7]⊤ and ŷ2 = v̂r = [9.8 9.4 6.6 4.2]⊤ .

The error in two models computed using (1.2) are

E1 = 0.7 and E2 = 0.2. (1.7)

1.1.2 Minimum Norm Solution


The data representation below is same as (1.1), i.e.,

Aw = y; A ∈ Rn×m , w ∈ Rm , and y ∈ Rn . (1.8)

If n < m, i.e., more parameters and less number of equations, then the so-
lution is obtained as

Minimize ∥ w ∥= w⊤ w (1.9)
subject to Aw = y.

The Lagrangian cost function for (1.9) is defined as


1
E = (w⊤ w + λ (Aw − y)). (1.10)
2
The parameters λ and w are obtained as
• λ:
δE
= 0. (1.11)
δλ
On using (1.10) in (1.11) gives

Aw = y

which on simplifying gives

−AA⊤ λ ⊤ = y. (1.12)

• w:
δE
= 0. (1.13)
δw

4
On using (1.10) in (1.13) gives

w⊤ + λ A = 0 ⇒ w = −A⊤ λ ⊤ (1.14)

which on further using (1.12) gives

w = A⊤ (AA⊤ )−1 y. (1.15)

Example 1.2. The data set consists of 50 samples from each of three species
of Iris (Iris setosa, Iris virginica, and Iris versicolor). Four features are mea-
sured from each sample: the length and the width of the sepals and petals,
in centimeters. Based on the combination of these four features, Fisher []
developed a linear discriminant model to distinguish the species from each
other.

Figure 1.3: The first three figures are three different species of Iris flower
and the fourth one shows ‘petal’and ‘sepal’of the flower.

Solution. The data can be represented via linear model using the following

Figure 1.4: The three different species based on UCI Iris data.

5
expression
Aw = y. (1.16)
For this we need to load data from the required dataset and perform some
basic steps to obtain the model parameters and associated weights. The
code snippets in Python are given below.
i Loading data in Python:

Using TensorFlow backend


from sklearn.datasets import load_iris
import numpy as np
import keras
np.random.seed(10)

ii Input feature and output label representation: After iris data is im-
ported in Python, an object, i.e., a dict data represents input feature
and targets represent output label.

code
iris=load_iris()
print(‘keys in iris dictionary: ’, iris.keys())
X=iris[‘data’]
print(‘First 3 entries of X:’, X[:3])
Y=iris[‘data’]
print(‘First 3 entries of Y:’, Y[:3])
names=iris[‘target names’]
print(‘names:’, names)
feature_names=iris[‘feature_names ’]
print(‘feature_names:’, feature_names)
# Track a few sample points
isamples=np.random.randint(len(Y), size=(5))
print(isamples)
solution
keys in iris dictionary: dict_keys([‘data’,‘target’, ‘frame’,
‘target_names’,‘DESCR’,
 ‘feature_names’,  ‘filename’, ‘data_module’])
5.1 3.5 1.4 0.2
First 3 entries of X: 4.9 3.0 1.4 0.2
 4.7 3.2  1.3 0.2
First 3 entries of Y: 0 0 0
names: {‘setosa’, ‘versicolor’, ‘virginica’}
feature_names: { ‘sepal length (cm)’, ‘sepal width (cm)’, ‘petal length (cm)’,
‘petal
 width (cm)’} 
9 125 15 64 113

6
iii Shape of data: Data is shaped into different sample types.

code
print(‘Shape of X:’, X.shape)
print(‘Shape of Y:’, Y.shape)
print(‘X - samples: ’, X[isamples])
print(‘Y - samples: ’, Y[isamples])
solution
Shape of X: (150, 4)
Shape of Y: (150,)
 
4.9 3.1 1.5 0.1
7.2 3.2 6.0 1.8
 
X- samples: 5.7 4.4 1.5 0.4 

5.6 2.9 3.6 1.3
5.7 2.5 5.0  2.0
Y - samples: 0 2 0 1 2

iv Convert labels to unique categories: The labels for data is converted


into unique categories.

code
from keras.utils import to_categorical
#Ny is number of categories/classes
Ny=len(np.unique(Y))
print(‘Ny:’, Ny)
# Convert to 1-hot encoding
Y=to_categorical(Y[:], num_classes=Ny)
print(‘X - samples:’, X[isamples])
print(‘Y - samples:’, Y[isamples])
solution
Ny: 3  
4.9 3.1 1.5 0.1
7.2 3.2 6.0 1.8
 
X - samples:  5.7 4.4 1.5 0.4 

5.6 2.9 3.6 1.3
5.7 2.5 5.0 2.0
1 0 0
0 0 1
 
Y - samples: 1 0 0 

0 1 0
0 0 1

v Training - Testing splitting of data (randomly into 80%-20%)

7
code
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
test_size = 0.20, random_state = 1)
print(‘X_train shape:’, X_train.shape)
print(‘X_test.shape:’, X_test.shape)
solution

X_train shape: (120, 4)


X_test.shape: (30, 4)

vi Scaling the data (X): To use the data appropriately it is scaled to zero-
mean and unit-variance.

code
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# Computes the mean and standard deviation
scaler.fit(X_train)
# Perform transformation: x = (x-mean)/std
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
# first 5 samples of X_train
print(‘X_train: ’, X_train[:5])
print(‘Y_train:’, Y_train[:5])
solution
X_train:
 
0.31553662 −0.04578885 0.44767531 0.23380268
 2.2449325 −0.04578885 1.29769171 1.39742892 
 
 −0.2873996 −1.24028061 0.05100098 −0.15407273
 
 0.67729835 −0.52358555 1.01435291 1.13884531 
−0.04622511
 −0.52358555
 0.73101411 1.52672073
0 1 0
0 0 1
 
Y_train:  0 1 0

0 0 1
0 0 1

The feature parameters are ls , bs , l p , and b p . The values of these parameters


and the encoding y0 , y1 , and y2 are obtained from above solution.

8
ls bs lp bp y0 y1 y2
0.31 -0.04 0.45 0.23 0 1 0
2.24 -0.04 1.30 1.40 0 0 1
-0.29 -1.24 0.05 -0.15 0 1 0
0.68 -0.52 1.01 1.14 0 0 1
-0.05 -0.52 0.73 1.53 0 0 1

Consider the following equation


 
w00 w01 w02
  
w10 w11 w12   
ls bs lp bp 1  .. .. ..  = y0 y1 y2 . (1.17)
 . . . 
w40 w41 w42

To find the solution, i.e., associated weights, (1.17) is written in the form

AW = Y, (1.18)
   
where A = ls bs lp bp 1 and Y = y0 y1 y2 . On solving (1.18) fur-
ther using the least square solution

W = (A⊤ A)−1 A⊤Y, (1.19)

gives
 
0 −0.4543 0.0665
0 −0.3726 −0.1308
W = 
0 2.0741 −0.0692 . 
0 −1.2851 0.6930

The above section discusses solutions through linear modelling approaches.


However, these will not work if nonlinear modelling is required. A very
effective tool for nonlinear modeling is artificial neural networks (ANNs).
They use learning algorithms that can independently make adjustments -
or learn, in a sense - as they receive new input.

1.2 Artificial Neural Networks


Artificial neural networks (ANN) have played a major role in the develop-
ment of intelligent control schemes. These networks are used exclusively
for system identification as well as for controller parameterization. In this
chapter, multi-layered networks, radial basis function network, recurrent
networks and self-organizing map networks will be discussed from a func-
tion approximation perspective. Multi-layered network and radial basis

9
function network have feed-forward connection structures. When these
networks are represented as controllers or as system models, it becomes
relatively easy to perform closed-loop analysis of a control system in terms
of stability and convergence. Thus extensive use of feed-forward networks
in intelligent control system development in the existing literature is not
surprising.

1.2.1 Feed-Forward Networks


The simplest form of an artificial neural network is called a perceptron. The
single layer perceptron model was first developed by Rosenblatt in 1958
[rosen:58]. The basic model consists of a single neuron with adjustable
synaptic weights and bias. However, in 1969, Marvin Minsky and Sey-
mour Papert wrote a book called ”Perceptrons” [minsky:69] in which they
proved mathematically that single-layer perceptrons could only classify lin-
early separable patterns. This book made a turning point in the field of ar-
tificial neural networks by showing the limitations of single-layer percep-
trons. Though the idea of multi-layer perceptrons was there in the mind of
people, but they did not know how to train multi-layer perceptrons. As a
result, researchers almost lost their interests in neural networks until about
the mid-eighties, when the back-propagation algorithm for training multi-
layer perceptrons was discovered. The development of back-propagation
algorithm, along with its use in machine learning, was reported by Rumel-
hart, Hinton and Williams [rumel:86] in 1986. Werbos [werbos:74] and
Parker [parker:85] have also independently proposed back-propagation al-
gorithm in 1974 and 1985. In 1988, Broomhead and Lowe [broom:88] de-
scribed the design of layered feed-forward networks using radial basis func-
tions which came as an alternative to multi-layer perceptrons. Basically a
feed-forward network consists of multiple layers of neurons where the neu-
rons in one layer are forward-connected to the neurons of the next layer.
Although there are many variants of feed-forward networks, multi-layered
networks and radial basis function networks have been used in many ap-
plications. These two networks can act as universal approximators, i.e. any
nonlinear function can be approximated with an arbitrary accuracy using a
properly tuned multi-layered network or radial basis function network.

1.3 Multi-Layered Neural Networks


A multi-layered neural network (MNN) as shown in Fig. 1.5 usually con-
sists of an input layer of a set of sensory units or source nodes, L − 1 hidden
layers of neurons and an output layer of neurons. The input signal propa-
gates through the network on a layer-to-layer basis in the forward direction.

10
Each neuron actuates response using sigmoidal activation function. As
mentioned earlier, MNNs have been successfully applied in solving some
difficult problems by training them in a supervised manner with the er-
ror back-propagation algorithm. The back-propagation algorithm uses the
principle of gradient descent to train the network parameters. The synopsis

x1 y1

x2 y2

xm yn

Input Wi1 ,i0 Layer 1 Wi2 ,i1 Layer 2 Wi3 ,i2 Layer 3 WiL ,iL−1 Layer L
i0 i1 i2 i3 iL

Figure 1.5: An L-layered network.

of the notations used to represent an MNN with L layers is as follows:


x : m × 1 input vector
y : n × 1 output vector
ik ; k = 1, · · · , L : index for representing a neuron in kth layer
i0 : index for representing a neuron in input layer
hik : Weighted sum of input stimuli to ith th
k neuron in k layer
th
vik : response of ik neuron in k layerth

Wi1 i0 : Weight connecting ith th


1 unit of 1st layer and i0 unit of input layer
Wi2 i1 : Weight connecting ith th
2 neuron of 2nd layer and i1 unit of 1st layer
k neuron of k layer and ik−1 neuron of k − 1
th
Wik ik−1 : Weight connecting ith th th

layer

1.3.1 Principle of Gradient Descent


Suppose that a function y = f (x) has to be approximated by a neural net-
work. If the number of training patterns available is N then the cost function
to be minimized to learn this mapping is defined as follows:
N
E= ∑ E p, (1.20)
p=1

where E p = 12 (ydp − y p )2 . y p and ydp denote the pth actual and desired pat-
tern. Since y p is a function of the network weight vector w, E can also be

11
expressed as a function of w. Consider a typical relationship between the
cost function E and the weight vector w, as shown in Figure 1.6, where the
cost function has only one global minimum. The principle of gradient de-

∂E
wnew = wold − η
∂w

Negative Positive Slope


Slope

Figure 1.6: Weight update rule in Gradient Descent algorithm.

scent tells that to minimize the cost function E, the weight vector w should
be updated using the following rule:

∂E
wnew = wold − η
∂w wold

where η is the learning rate. Figure 1.6 shows that E attains its minimum
value at w = wmin . One should notice that when w is less than wmin , the slope
∂E
∂ w is negative thus the change in w is positive which will move w towards
wmin . Similarly when w is greater than wmin , the change in w is negative
which makes w to move in the left direction, i.e., towards the direction of
wmin . In both the cases, w will tend to wmin in a number of steps depending
on the learning rate η .

Batch Update: When the weight vector w is updated such that ∆w = wnew −
wold is a function of the overall cost function E, i.e., ∆w = −η ∂∂ w
E
, the up-
date rule is termed as a batch update which is a offline technique of weight
update.

Instantaneous Update: When the weight vector w is updated such that


∆w = wnew − wold is a function of the instantaneous cost function E p , i.e.,
∆w = −η ∂∂Ew the update rule is termed as an instantaneous update.
p

12
1.3.2 Derivation of Back Propagation Algorithm
The back propagation (BP) algorithm [Lipp:87, naren:91] offers an effec-
tive approach to the computation of the gradients. This can be applied to
any optimization formulation, i.e., any type of energy function, either max-
imization or minimization problem.
Let us consider a two layered network as shown in Figure 1.7 where i2 ,
i1 and i0 refer to output, hidden, and input layers, respectively.
The objective of this algorithm is to adjust the weights Wi2 i1 and Wi1 i0 in
order to minimize a cost function E which will train the network mapping
of the required function.

v1
x1
y1
x2
v2
x3 y2

xm yn2
Wi1 ,i0 vn1 Wi2 ,i1
Input
i0 Layer 1 Layer 2
i1 i2

Figure 1.7: Two layered network.

Forward Phase
As shown in Figure 1.7, the input to the ith
1 neuron of the hidden layer is
given by
m
hi1 = ∑ Wi1i0 xi0 , (1.21)
i0 =1

where p is the dimension of the input vector x. The output of ith


1 neuron of
the hidden layer is given as
1
vi1 = Ψ(hi1 ) = , (1.22)
1 + e−hi1
where Ψ(·) is the sigmoidal activation function. The final response (actual
output) of the ith
2 neuron in the output layer is given as

1
yi2 = vi2 = Ψ(hi2 ) = , (1.23)
1 + e−hi2

13
where
n1
hi2 = ∑ Wi2i1 vi1 . (1.24)
i1 =1

Back-Propagation of Error

The weights of a multi-layered network can be updated using either batch


mode or instantaneous mode. However, for most applications including
control applications that need real-time implementations, it is beneficial to
perform instantaneous update. The instantaneous back-propagation algo-
rithm will be derived in this section.
The instantaneous cost function in error can be expressed as
n2
1 n2 d
E(t) = ∑ Ei2 (t) =
2 i2∑
(yi2 (t) − yi2 (t))2 , (1.25)
i2 =1 =1

where
1
Ei2 (t) = (ydi2 (t) − yi2 (t))2 (1.26)
2
and ydi2 is desired response of the ith
2 unit of the output layer.

Update of the weights connecting the output layer and the hidden
layer

The update of weight using the gradient descent algorithm Is

∂ E(t)
Wi2 i1 (t + 1) = Wi2 i1 (t) − η , (1.27)
∂ Wi2 i1 (t)
∂E
where η is the learning rate. The derivative ∂ Wi2 i1 is computed as follows:

∂ E(t) n2
∂ Ei2 (t) ∂ yi2
= ∑ × , (1.28)
∂ Wi2 i1 (t) i2 =1 ∂ yi2 ∂ Wi2 i1 (t)

where the first term is computed using equation (1.26) as

∂ Ei2 (t) 1
= − × 2(ydi2 − yi2 ) (1.29)
∂ yi2 2

and the second term is computed as

∂ yi2 ∂ yi2 ∂ hi2


= × . (1.30)
∂ Wi2 i1 ∂ hi2 ∂ Wi2 i1

The following relations are obtained using equations (1.23) and (1.24), re-

14
spectively:
 
∂ yi2 ∂ 1
= = yi2 (1 − yi2 ) and (1.31)
∂ hi2 ∂ hi2 1 + e−hi2
∂ hi2
= vi1 . (1.32)
∂ Wi2 i1

The gradient (1.28) is obtained using equations (1.29), (1.31), and (1.32) as

∂ E(t)
= −(ydi2 − yi2 )yi2 (1 − yi2 )vi1 . (1.33)
∂ Wi2 i1 (t)

The update law turns out to be

Wi2 i1 (t + 1) = Wi2 i1 (t) + η (ydi2 − yi2 )yi2 (1 − yi2 )vi1


= Wi2 i1 (t) + ηδi2 vi1 , (1.34)

where δi2 = yi2 (1 − yi2 )(ydi2 − yi2 ) is the error back propagated from output
layer.

Update of the weights connecting the hidden layer and the input
layer

Unlike a neuron in the output layer, a neuron in the hidden layer has no
specified desired response. Thus the output error has to be back-propagated
so that weights connecting to hidden layer from the input layer can be up-
dated.

As usual, the update of the weight using the gradient descent algorithm
looks as
∂ E(t)
Wi1 i0 (t + 1) = Wi1 i0 (t) − η (1.35)
∂ Wi1 i0 (t)

∂E
The derivative term ∂ Wi1 i0 is computed as

∂ E(t) n2
∂ Ei2 (t) ∂ yi2
= ∑ × (1.36)
∂ Wi1 i0 (t) i2 =1 ∂ yi2 ∂ Wi1 i0 (t)

where the first term is computed using equation (1.29) and the second term
is computed as follows:

∂ yi2 ∂ yi2 ∂ vi1


= × (1.37)
∂ Wi1 i0 ∂ vi1 ∂ Wi1 i0

15
In this equation first term is computed using equations (1.23) and (1.24):

∂ yi2 ∂ yi2 ∂ hi2


= × = yi2 (1 − yi2 )Wi2 i1 , (1.38)
∂ vi1 ∂ hi2 ∂ vi1

while the second term is computed using equations (1.21), and (1.22):

∂ vi1 ∂ vi1 ∂ hi1


= × = vi1 (1 − vi1 )xi0 . (1.39)
∂ Wi1 i0 ∂ hi1 ∂ Wi1 i0

Thus the final computable expression for the gradient term (1.36) is ob-
tained using equations (1.29),(1.38) and (1.39):

∂E n2

∂ Wi1 i0
= − ∑ (ydi2 − yi2 )yi2 (1 − yi2 )Wi2i1 vi1 (1 − vi1 )xi0
i2 =1
n2
= −vi1 (1 − vi1 )xi0 ∑ δi2Wi2i1 .
i2 =1

Thus the update law (1.35) has the final computable form:

Wi1 i0 (t + 1) = Wi1 i0 (t) + ηδi1 xi0 , (1.40)


n2
where δi1 = vi1 (1 − vi1 ) ∑ δi2Wi2i1
i2 =1
The above recursive formula is the key to back propagation learning. It
allows error signal of a lower layer to be computed as a linear combination
of the error signal of the upper layer. In this manner, the error signals are
back propagated through all the layers from the top down. This implies that
the influence from an upper layer to a lower layer and vice-versa can only
be effected via the error signals of the intermediate layer.
BP algorithm is so popular because of the fact that each weight can be
simultaneously updated as the factors involved are all local.

1.3.3 Generalized Delta Rule


The delta rule discussed in the previous section can now be generalized to
a generic multi-layer network with L layers as shown in Figure 1.5.
Let Wik ik−1 denote the synaptic weight connecting the ith neuron of layer
k to that of layer k − 1. Considering sigmoid function as the activation func-
tion for each layer, the weight update law can be written as

Wik il−1 (t + 1) = Wil ik−1 (t) + ηδil vik−1 , (1.41)

16
where vi is the output of ith neuron of layer k − 1. The term δik for each layer
k−1
is expressed as

δiL = (ydi − yiL )yiL (1 − yiL ) for output layer Land (1.42)
L
nk
δil = vil (1 − vil ) ∑ δil+1 Wil+1 il for other hidden layers. (1.43)
i=1

All the weights can be updated in a similar manner. Thus for example, if
one considers a four layered network, the weight update rule for 4th layer
or the output layer will be:

Wi4 i3 (t + 1) = Wi4 i3 (t) + ηδi4 vi3


where
δi4 = (ydi − yi4 )yi4 (1 − yi4 ).
4

yi4 is the ith unit of the output layer and ydi is the corresponding desired
4
output. Similarly, weight update law for 3rd layer is derived as follows:

Wi3 i2 (t + 1) = Wi3 i2 (t) + ηδi3 vi2


n4
where δi3 = vi3 (1 − vi3 ) ∑ δi4 Wi4 i3
i=1

The weight update law for 2nd layer:

Wi2 i1 (t + 1) = Wi2 ,i1 (t) + ηδi2 vi1


where
n3
δi2 = vi2 (1 − vi2 ) ∑ δi3 Wi3 i2 .
i=1

The number of neurons in ith layer is considered as ni . Weight update law


for 1st layer:

Wi1 i0 (t + 1) = Wi1 i0 (t) + ηδi1 xi0


where
n2
δi1 = vi1 (1 − vi1 ) ∑ δi2 Wi2 i1 .
i=1

xi0 represents a typical signal in the input layer.


The recursive formula is the key to back-propagation learning. It allows
the error signal of a lower layer δil−1 to be computed as a linear combination
of the error signal of the upper layer δil . In this manner the error signals are
back-propagated through all the layers from the top down. This implies that

17
the influences from an upper layer to a lower layer and vice-versa can only
be effected via the error signals of the intermediate layer.

Example 1.3. Let us consider the following discrete time dynamical sys-
tem.
y(k)
y(k + 1) = + u3 (k) (1.44)
1 + y2 (k)
Identify the system using a multi-layered network.
Solution
Since a feed-forward network is a static network, when a dynamical system
is identified using a this kind of network, the previous state is also input to
the network along with the original system input. The system being a stable
one, input is generated randomly between ±1. 1000 data pairs are gener-
ated with this input range using equation (1.44). An MLN is considered
with one input layer, one hidden layer and one output layer. The number
of neurons in the hidden layer is taken as 15 and sigmoid function is taken
as the activation function. An instantaneous update using back propaga-
tion algorithm is employed to train the network. The training results are
shown in Figure 1.8 where the left figure shows the input-output relation-
ship of the training data as well as the network output after training. Right
figure shows the desired and actual output of the network at every sampling
instant. The convergence of mean square error over 1500 epochs is shown
in Figure 1.9 (left). Once the training is over, 1000 new data points are gen-
erated to validate the identified model while taking the input to the system
as u(k) = sin(0.1 k). The result of model validation is shown in Figure 1.9
(right). The RMS error for the training data set is found to 0.022 while for
the testing data set it is 0.034. The MATLAB code for the above example is
given as follows.
2
2
d
y d
y y
y
1 1
y ,y

y ,y

0 0
d

-1 -1

-2 -2
-1 -0.5 0 0.5 1 0 100 200 300 400 500
u Sampling Instant

Figure 1.8: Example 1.3- System identification results after training.

18
0.07 d
2 y
y
0.06
Mean square error

0.05 1

y ,y
0.04
0

d
0.03

0.02 -1

0.01
-2
0
0 300 600 900 1200 1500 0 100 200 300 400 500
No. of epochs Sampling Instant

Figure 1.9: Example 1.3- left: Epoch wise convergence of mean square error
and right: Model validation using the test data.

mlpsiso.m (MATLAB code)


clear
d(1)=0.5; p=zeros(2,1000);
for k=1:1000
u(k)= 2*rand - 1;
d(k+1) = d(k)/(1+d(k)*d(k)) + u(k)*u(k)*u(k);
p(1,k)=u(k); p(2,k)=d(k); t(k)=d(k+1);
end
pr = [-1 1;-1.5 1.5];
net = newff(pr,[50 1],’radbas’ ’purelin’,’traingdx’);
y = sim(net,p);
plot(u,t,’*’,u,y,’o’)
net.trainParam.epochs = 10000;
net.trainParam.goal = 0.0001;
net.trainParam.lr = 0.3;
net = train(net,p,t); y = sim(net,p);
figure plot(u,t,’*’,u,y,’o’)
d(1)=0.5; p=zeros(2,1000);
for k=1:1000
u(k)= sin(0.1*k);
d(k+1) = d(k)/(1+d(k)*d(k)) + u(k)*u(k)*u(k);
p(1,k)=u(k); p(2,k)=d(k); t(k)=d(k+1);
end
y = sim(net,p);
figure plot(t) hold on plot(y,‘r’)

19
1.3.4 Convergence of BP Learning Algorithm
The weight update law in back propagation is derived using the gradient
descent algorithm where a cost function E has to be minimized with respect
to the weight vector w. The gradient-descent algorithm ensures that the first
order approximation of E(w) reaches its global minimum but it does not
ensure global convergence of the original cost function E(w). The weight
update law is given as

w(t + 1) = w(t) − η ∇E(t)

where η > 0 is the learning rate and ∇E is the gradient of cost function
E. Learning rate η determines the speed of convergence. Convergence de-
pends on proper choice of initial conditions. Let us consider minimization
of following function E(w) = 0.5w2 − 8 sin w + 7. It is a multi-modal func-
tion with two local minima and a global minimum.
The gradient-descent update law for parameter w is obtained as

∂E
∆w = −η = −η (w − 8 cos w)
∂w
Figure 1.10 shows the convergence of the cost function with respect to the
weight w. It can be seen from the figure that when η = 0.01 and w0 = −8,
GD η = 0.01
60 GD η = 0.4
Actual

W0 = -8
40
f(w)

20

0
-10 -5 0 5 10
w

Figure 1.10: Convergence of gradient descent algorithm

the final weight w = −4 corresponds to a local minimum. When η = 0.4


and w0 = −8 the local minimum can be avoided and final weight ultimately
settles down at the global minimum. One can conclude that for small step
size, BP gets stuck at a local minimum. For larger step size, it may come
out of local minima and get into global minimum but the algorithm may

20
not converge for larger step size. Moreover when the step size is large, it
zigzags its way about the true direction to the global minimum thus leading
to a slow convergence. There is no comprehensive method to select initial
weight w0 and learning rate η for a given problem so that the optimal weight
vector can be reached.

1.3.5 Variations in Back Propagation Algorithm


As discussed earlier, the back propagation algorithm is extremely slow in
converge. Several variations have been suggested to improve the speed and
also to account for other factors such as generalization ability and avoid-
ance of local minima. Since the BPA is a gradient descent algorithm, all
techniques available for improving upon the gradient descent procedure
apply to BPA. Those include making the learning rate η and the coefficient
for the momentum term either gradually decay or adaptive. Also, any sec-
ond derivative-based nonlinear optimization method, such as the Newton-
Raphson method or the conjugate gradient method, may be adopted for the
training of feed-forward networks. However, these variants also share prob-
lems present in the standard BPA and may converge faster in some cases and
slower in others.

Adding a momentum term

The problem arises from error surfaces with step sides and shallow valley
floor. One simple but effective way to deal with this problem is addition of
a momentum factor. This allows us to safely increase the learning rate and
to avoid long flat surfaces.
Adding a momentum factor simply means that one is giving a momen-
tum to the learning, i.e. learning rate is made faster. If α is the momentum
rate, then adding a momentum leads to the following change in the weight
update law.

∂E
w(t + 1) = w(t) − η + α [w(t) − w(t − 1)]
∂ w(t)
w(t + 1) − w(t) = w(t) ∂E [w(t) − w(t − 1)]
= −η +α
∆w ∂ w(t) ∆w

∂E
(1 − α )∆w − η
∂ w(t)
η ∂E
∆w = − × (1.45)
1 − α ∂ w(t)

21
From the above equation one can find that learning rate is increased by a
factor of (1−1α ) . Thus the plateau region can be crossed at much faster rate.

1.4 Radial Basis Function Networks


The popularity of Radial basis function network (RBFN) is mainly due its
simple architecture with a single hidden layer, which is specially good in
applications requiring locally tunable properties [haykin:99]. As the name
implies, this network makes use of radial functions to represent an input
in terms of radial centers. However, this network is very attractive for in-
telligent control applications since the response of this network is linear in
terms of weights. Thus weight update rules for such networks within intel-
ligent control paradigm become easy to derive through Lyapunov stability
analysis as the readers will notice in Chapters 4 and 5. The network archi-
tecture of a radial basis function network (RBFN) is shown in figure.1.11.
The RBFN consist of three layers:

• An input layer

• A hidden layer

• An output layer

The hidden units provide a set of functions that constitute an arbitrary basis
for the input patterns.

1. Hidden units are known as radial centers. Each radial center is rep-
resented by a vector ci , i = 1, . . . , h, where h is the number of radial
centers in the hidden layer.

2. The transformation from input space to hidden unit space is non-


linear whereas the transformation from hidden unit space to output
space is linear.

3. Dimension of each center for a p input network is p × 1.

1.4.1 Radial Basis Functions


The radial basis function in the hidden layer produces a significant non-
zero response only when the input falls within a small localized region of
the input space. Each hidden unit, known as radial center, has its own ’re-
ceptive field’ in the input space, i.e. each center is representative of one or
some of the input patterns. This is called ’local representation of inputs’
and the network is also known as ’localized receptive field network’. Con-
sider, for instance, an input vector x which lies in the receptive field for

22
x1 b

ϕ1
x2 b
w1
c2
ϕ2
x3 b w2
.. y
..
x4 b
.. wn
.. ..
.. .
..
.. ch ϕh
.
xp b

Input Hidden layer Output


layer of Radial layer
Basis functions
Figure 1.11: Radial Basis Function Network

center ci . This would activate the hidden center ci and by proper choice of
weights the target output is obtained. Suppose an input vector lies between
two receptive field centers, then both those hidden units will be apprecia-
bly activated. The output will be a weighted average of the corresponding
targets. In short the inputs are clustered around the centers and the output
is linear in terms of weights, wi .

h
y = ∑ ϕi wi (1.46)
i=1

The response of the ith radial center in RBFN usually expressed by the fol-
lowing expression:
ϕi = ϕ (||x − ci ||) (1.47)

23
where ||x − ci || is the Euclidean distance between x and ci . ϕ (.) is a radial
basis function. When this basis function is Gaussian, each node produces
an identical output for inputs within a fixed radial distance from the center,
i.e. they are radially symmetric. This accounts for the nomenclature ’radial
basis function’. Other activation functions include

ϕ (z) = e−z /2σ


2 2
- Gaussian radial function
ϕ (z) = z2 log z - Thin plate spline
ϕ (z) = (z2 + r2 )1/2 - Quadratic
ϕ (z) = (z2 +r12 )1/2 - Inverse quadratic

1.4.2 Learning in RBFN


Training of RBF network requires optimal selection of the center ci and
weights, wi , i = 1 to h. This is a two fold problem, unlike as in multi-layer
network. As the different layers of the network perform different tasks,
both the layers are optimized using different techniques and in different
time scales. Several strategies are applicable depending on the way in which
the radial centers are specified. Some of these strategies are discussed be-
low.

Pseudo-Inverse Technique

Pseudo-inverse approach computes the weight vector w in batch mode, i.e.


the weight-update is done after the network computes the forward responses
of all input patterns. Assume that radial functions for the hidden units are
Gaussian. The centers are chosen randomly from the training data set (pro-
vided the training set is representative). The Gaussian function is normal-
ized, i.e. for any x, ∑ ϕi = 1. The width of the radial basis function is usually
i
determined by an adhoc choice:

d
σ=√ (1.48)
2h
where d is the spread of the centers, i.e. the maximum distance between
two centers, and h is the number of centers.
Such a choice of σ helps to avoid extremas. Moreover performance of
the network is not very sensitive to the precise values of σ .
Once the width of the basis function is fixed, next task is to learn the
weights. For a given input-output pattern (x p , y p ), the output in terms of
weight vector w can be written as

y p = ϕ pT w (1.49)

24
bias = +1
rs

x1 rs c1 ϕ1
w1 θ

w2
x2 rs c2 ϕ2

Figure 1.12: RBF network for solving the problem

(− h
||x p −c ||2 )
where ith element of ϕ p is computed as ϕip = e d 2 i
. Collection of
such expressions over all patterns will lead to following least square prob-
lem:
Φw = y (1.50)
T
where ith row of Φ is ϕ i . y = [y1 , . . . , y p , . . .]T . The weight vector w can be
solved using the following pseudo inverse:

w = (ΦT Φ)−1 ΦT y (1.51)

Example 1.4. Consider the RBF network as shown in Figure 1.12 to solve
an EX-NOR problem:
x1 x2 y
0 0 1
0 1 0
1 0 0
1 1 1
The output of the RBF network is given by

y = w1 ϕ1 + w2 ϕ2 + θ

Assuming Gaussian radial centers, ϕ (x) = e−x , find out the weights, w1
2

and w2 , associated with the two inputs of the network and θ , the weight,
associated with the output bias.
Solution
Applying the four training patterns one after another following equations
are obtained:

w1 + w2 e−2 + θ = 1

w1 e−1 + w2 e−1 + θ = 0

w1 e−1 + w2 e−1 + θ = 0

25
w1 e−2 + w2 + θ = 1
Thus there are 4 equations to be solved for 3 unknowns w1 , w2 , θ . In this
case,    
1 0.1353 1   1
 0.3679 0.3679 1  w1  0 
Φ=   
 0.3679 0.3679 1 , w= w2 , y = 0 
 
θ
0.1353 1 1 1
Using
 equation(1.51), one gets the desired weights as
2.5031
w= 2.5031 .

−1.848

Gradient Descent Algorithm

One of the most natural approach to update centers and weights is super-
vised training by error correction learning. This is easily implemented by
using a gradient descent procedure. The update rule for center learning is
given below.

∂ E(t)
ci j (t + 1) = ci j (t) − η1
∂ ci j (t)
for i = 1 to p and j = 1 to M.
Similarly the update for weights are given as

∂ E(t)
wi (t + 1) = wi (t) − η2
∂ wi (t)
where the cost function is E = 12 ∑(yd − y)2
The actual response of the RBF network as shown in figure 1.11 is com-
puted as
h
y = ∑ ϕi wi
i=1

Let the radial basis function be taken as Gaussian:

ϕi = e−zi /2σ
2 2

where zi = ∥x−ci ∥ and σ is the spread or width of the Gaussian function.


Differentiating E with respect to wi yields:

∂E ∂E ∂y
= × = −(yd − y)ϕi (1.52)
∂ wi ∂ y ∂ wi

26
Differentiating E with respect to ci j yields:

∂E ∂E ∂y ∂ ϕi
= × × (1.53)
∂ ci j ∂ y ∂ ϕi ∂ ci j
∂ ϕi ∂ zi
= −(yd − y) × wi × × (1.54)
∂ zi ∂ ci j
∂ ϕi zi
= − 2 ϕi (1.55)
∂ zi σ
∂ zi ∂
∂ ci j ∑
= ( (x j − ci j )2 )1/2 (1.56)
∂ ci j j
= −(x j − ci j )/zi (1.57)

After simplification, the update rule for centers is:


ϕi
ci j (t + 1) = ci j (t) + η1 (yd − y)wi (x j − ci j ) (1.58)
σ2
The update rule for weights:

wi (t + 1) = wi (t) + η2 (yd − y)ϕi (1.59)

The gradient descent vector ∂ E/∂ ci j exhibits a clustering effect. Note that
for RBF network, there is no back-propagation of error unlike the super-
vised learning in multi-layered network.

Example 1.5. Consider the dynamical system described by equation (1.44).


Identify the system using a radial basis function network.
Solution
As mentioned in example 1.3, since the system is dynamic while the RBF is a
static network, the previous state is also input to the network along with the
original system input during the identification process. Input is generated
randomly between ±1. 1000 data pairs are generated with this input range
using equation (1.44). The number of RBFN centers is taken as 100. The
gradient descent algorithm is employed to update the weights as well as
the centers of the network. The training results are shown in Figure 1.13
where Figure 1.13(a) shows the input-output relationship of the training
data as well as the network output after training and Figure 1.13(b) shows
the desired and actual output of the network at every sampling instant.
The convergence of mean square error over 1500 epochs is shown in
Figure 1.14. Once the training is over, 1000 new data points are generated to
validate the identified model while taking the input to the system as u(k) =
sin(0.1 k). The result of model validation is shown in Figure 1.15. The RMS
error for the training data set is found to 0.026 while for the testing data set
it is 0.04.

27
2
2
d
y d
y y
y
1 1
y ,y

y ,y
0 0
d

d
-1 -1

-2 -2
-1 -0.5 0 0.5 1 0 100 200 300 400 500
u Sampling Instant

(a) (b)

Figure 1.13: System identification results after training

0.2

0.15
Mean square error

0.1

0.05

0
0 300 600 900 1200 1500
No. of epochs

Figure 1.14: Example 1.5: Epoch wise convergence of mean square error

The MATLAB code is given as follows.

28
d
2 y
y

y ,y
0

d -1

-2

0 100 200 300 400 500


Sampling Instant

Figure 1.15: Example 1.5: Model validation using the test data

rbfsiso.m (MATLAB code)


clear
d(1)=0.5;p=zeros(2,1000);
for k=1:1000
u(k)= 2*rand - 1;
d(k+1) = d(k)/(1+d(k)*d(k)) + u(k)*u(k)*u(k);
p(1,k)=u(k); p(2,k)=d(k); t(k)=d(k+1);
end
pr = [-1 1;-1.5 1.5];
goal = 0.000001;
net = newrb(p,t,goal);
y = sim(net,p);
size(y)
plot(u,t,’*’,u,y,’o’)

d(1)=0.5;p=zeros(2,1000);
for k=1:1000
u(k)= sin(0.1*k);
d(k+1) = d(k)/(1+d(k)*d(k)) + u(k)*u(k)*u(k);
p(1,k)=u(k); p(2,k)=d(k); t(k)=d(k+1);
end
y = sim(net,p);
figure
plot(t);
hold on;
plot(y,’r’)

29
Hybrid Learning
In hybrid learning, the radial basis functions relocate their centers in a self
organized manner while the weights of the output layers are computed us-
ing supervised learning rule. Hence the name ’hybrid learning’. Due to
self-organized learning the radial centers are placed in those regions where
significant input data is present. When a pattern is presented to the RBF
network during training, either a new center is grown if the pattern is suf-
ficiently novel or the network parameters in both layers are updated using
gradient descent. The test for novelty relies on two criteria:

1. Is the Euclidean distance between the input pattern and the nearest
existing center greater than a threshold δ (t) ?

2. Is the mean square error at the output greater than a desired accuracy
ε?

A new center is allocated when both of the novelty criteria are satisfied.
K-mean clustering technique is employed for the self organized selec-
tion of centers. It dynamically updates cluster centers for each iteration
until they reach a stable state. The center moves to the densest part of the
cluster and hence intra-cluster distance is minimized. The input vector x is
classified according to the most frequently occurring label among those of
the K nearest sample. The procedure of center update is as follows:
Find out a center that is closest to x in terms the Euclidean distance.
This particular center is updated according to the following rule

ci (t + 1) = ci (t) + α (x − ci (t))

R
Thus the center moves closer to x. If P(x) is the uniform probability of
x, ||x − ci ||2 P(x)dx, α (0 → 1) is gradually decreased to 0.
While centers are updated using unsupervised scheme, the weights can
be updated using any of the least square algorithms like gradient descent
(GD) or recursive least squares (RLS). The RLS [good:91] algorithm is pre-
sented below.
The ith output of the RBFN described earlier, can be written as

xi = ϕ T θ i , i = 1, . . . n

where ϕ ∈ Rl is the output vector of the hidden layer; θ i ∈ Rl is the


connection weight vector from the hidden units to the ith output unit. The
weight update equations as per RLS algorithm, are described as

θ i (k + 1) = θ i (k) + P(k + 1)ϕ (k)[xi (k + 1) − ϕ (k)T θ i (k)] (1.60a)

P(k + 1) = P(k) − P(k)ϕ (k)[1 + ϕ (k)T P(k)ϕ (k)]−1 ϕ (k)T P(k) (1.60b)

30
where P(k) ∈ Rl×l is known as covariance matrix. RLS is more accurate and
fast compared to least mean square (LMS) [simon:aft] algorithm.
The simplest approach is to update the centers using gradient descent
algorithm and the weights can be updated using simple LMS algorithm.
Although computational requirement increases by adjusting centers, the
number of centers can be substantially reduced by this approach. The gen-
eralization performance of such a network is much better as compared to
hybrid learning scheme where centers are fixed or learned unsupervised
and the weights are updated using recursive least squares algorithm.

Example 1.6. Consider the dynamical system described by equation (1.44).


Identify the system using a Hybrid learning scheme. Show the initial and
final distributions of RBFN centers.
Solution
For the dynamics system (1.44), input is generated randomly between ±1.
1000 data pairs are generated with this input range using equation (1.44).
The number of RBFN centers is taken as 100. K-SOM clustering is employed
to update the centers and RLS algorithm is used to update the weights of
the network. The initial covariance matrix P for RLS is taken as a diagonal
matrix with the diagonal element 100. Mapping of input data and centers
are shown in Figure 1.16. Left figure shows how the centers are initialized
and right figure shows the spreading of centers towards the nearest inputs
after learning. Identification results are found to be satisfactory with an

0.3 2
Data point
RBF center
RBF center
0.225
1
0.15
Input 2 : y(k)
Input 2

0.075
0
0

-0.075 -1

-0.15
-2
-0.16 -0.08 0 0.08 0.16 0.24 -1 -0.5 0 0.5 1
Input 1 Input 1 : u(k)

Figure 1.16: System identification results after training

RMS error of 0.018 for training and 0.027 for testing data.

31
1.5 Adaptive Learning Rate
Faster convergence and function approximation accuracy are two key is-
sues in choosing a training algorithm. The popular method for training
a feed-forward network has been the back propagation. One of the main
drawbacks of back propagation algorithm is its slow rate of convergence
and its inability to ensure global convergence. Some heuristic methods like
adding a momentum term to original BP algorithm and standard numerical
optimization techniques using quasi-Newton methods have been proposed
to improve the convergence rate of BP algorithm [haykin:99, sarkar:95].
The problem with quasi-Newton methods are that the storage and memory
requirements go up as the square of the size of the network. The nonlin-
ear optimization technique such as the Newton method, conjugate-gradient
etc. [Chara:92, osowski:96] have been used for training. Though the algo-
rithm converges in fewer iterations than BP algorithm, it requires too much
computation per pattern. Other algorithms for faster convergence include
extended Kalman filtering (EKF) [Iguni:92], recursive least square (RLS)
[Bilski:98] and Levenberg-Marquardt (LM) [Hagan:94, Lera:02]. In order
to overcome the computational complexity in these algorithms, a number
of improvements have also been suggested [Okyay:01, Toledo:05]. How-
ever, these improvements do not bring them closer to back-propagation al-
gorithm as far as simplicity and ease of implementation is concerned.
In a recent work [swagat:06], two novel algorithms have been proposed
on adaptive learning rate using Lyapunov stability theory. The proposed al-
gorithm has exact parallel with the popular BP algorithm where the fixed
learning rate in BP algorithm is replaced by an adaptive learning rate. It
is observed that this adaptive learning rate increases the speed of conver-
gence. Although Yu et.al. [Yu:02] and Yu et. al. [poznyak:01] have used
Lyapunov function based weight update algorithms, none of them address
the issue of computation of adaptive learning rate in a formal manner. The
nature of convergence has also not been discussed in their work.
It is shown in section 1.3.5 that a momentum term may be added to
BP algorithm in order to speed up its convergence rate. Such a term arises
naturally in Lyapunov function based algorithms and thus it provides a the-
oretical justification for such kind of modifications.

1.5.1 Lyapunov Function based Adaptive Learning Rate


Consider a simple feed-forward neural network with single output where
ϕ (.) is the nonlinear activation function for neurons. The network is pa-
rameterized in terms of its weights which is represented as a weight vector
w ∈ Rm . For a specific function approximation problem, the training data
consists of N patterns, {x p , ydp }, p = 1, 2, ..., N. For a specific pattern p, if the

32
input vector is x p , then the network output is given by

y p = f (w, x p ) p = 1, 2, . . . N. (1.61)

The usual quadratic cost function which is minimized to train the weight
vector w is given by
1 N
E = ∑ (ydp − y p )2 (1.62)
2 p=1
In order to derive a weight update algorithm for such a network, we con-
sider a Lyapunov function candidate as
1
V1 = (ỹT ỹ) (1.63)
2
where ỹ = [y1d −y1 , ....., ydp −y p , ....., yN
d −y ] . As can be seen, in this case the
N T

Lyapunov function is same as the usual quadratic cost function minimized


during batch update using back-propagation learning algorithm. The time
derivative of the Lyapunov function V1 is given by

∂y
V˙1 = −ỹT ẇ = −ỹT Jẇ (1.64)
∂w
where
∂y
J= ∈ RN×m (1.65)
∂w
Theorem 1.1. If an arbitrary initial weight w(0) is updated by
Z t′

w(t ) = w(0) + ẇdt (1.66)
0

where
∥ ỹ ∥2 T
ẇ = J ỹ (1.67)
∥ JT ỹ ∥2
then ỹ converges to zero under the condition that ẇ exists along the conver-
gence trajectory.

Proof. Substitution of Eq. (1.67) into Eq. (1.64), one gets

V˙1 = − ∥ ỹ ∥2 ≤ 0 (1.68)

where V˙1 < 0 for all ỹ ̸= 0. If V˙1 is uniformly continuous and bounded, then
according to Barbalat’s lemma [krstic:02] as t → ∞, V˙1 → 0 and ỹ → 0. 

The weight update law given in equation (1.67) is a batch update law.
Analogous to instantaneous gradient descent (or BP) algorithm, the instan-

33
taneous LF I learning algorithm can be derived as

∥ ỹ ∥2
ẇ = Jp T ỹ (1.69)
∥ Jp T ỹ ∥2

where ỹ = ydp − y p ∈ R and Jp = ∂∂yw ∈ R1×m is the instantaneous value of


p

the Jacobian. The difference equation representation of the weight update


algorithm based on Eq. (1.69) is given by

w(t + 1) = w(t) + µ ẇ(t)


∥ ỹ ∥2  T (1.70)
= w(t) + µ Jp ỹ
∥ Jp T ỹ ∥2

Here µ is a constant which is selected heuristically. A very small constant


ε can be added to the denominator of Eq. (1.69) to avoid numerical insta-
bility when error ỹ goes to zero. The Lyapunov function based algorithm is
compared with the BP algorithm based on gradient descent principle.
In the instantaneous gradient descent method, the expression has the
form:
 
∂E T
△w = −η
∂w (1.71)
= η Jp ỹ
T

w(t + 1) = w(t) + η Jp T ỹ (1.72)


where η is the learning rate. Comparing equation (1.72) with equation
(1.70), one can see a very interesting similarity where the fixed learning
rate η in BP algorithm is replaced by its adaptive version ηa given by

∥ ỹ ∥2 
ηa = µ (1.73)
∥ Jp T ỹ ∥2

This is the most remarkable finding of the work [swagat:06]. Earlier, there
have been many research papers concerning the adaptive learning rate [haykin:99,
Qiu:92, cheng:93, tol:90]. However, the computation of adaptive learning
rate using Lyapunov function approach is a key contribution in this field.
The Theorem 1.1 states that the global convergence of the learning al-
gorithm (1.67) is guaranteed provided ẇ exists and is non-zero along the
convergence trajectory. This, in turn, necessitates ∥ ∂∂Vw1 ∥=∥ J T ỹ ∦= 0. The
condition ∥ ∂∂Vw1 ∥= 0 represents local minima of the scalar function (1.63).
Thus, the theorem 1.1 says that the global minimum is reached only when
local minima are avoided during training. Since instantaneous update rule
introduces noise, it may be possible to reach global minimum in some cases,

34
however, the global convergence is not guaranteed.
When the algorithm is implemented to learn a XOR map, the Figure
1.17 shows how the learning rate naturally evolves and converges to zero
as learning is over. Readers are referred to the work [swagat:06] for more
details.

50

LF - I : XOR

40
Learning rate

30

20

10

0
0 100 200 300 400
Number of iterations

Figure 1.17: Adaptive learning rate while learning XOR map

1.6 Feedback Networks

The versatility of a neural network is considerably increased by the incor-


poration of feedback. Feed-forward networks as discussed earlier in this
chapter learn the map from the input space to the output space. Once the
weights are fixed, the state of each neuron is determined by the input vec-
tor, i.e. the neuron state is independent of the initial and past state of this
neuron. Hence a feed-forward network is a static network. There are no
dynamics involved. In contrast, a feedback network allows feedback con-
nection as shown in Figure 1.18.

35
y(t + 1)

MLN

x(t) z−1 z−1 z−1 z−1

(a)
w33
x3
w31 3 w35

x1 (t)rs w53
w51 x5
w32
w34 w43
w41 w52 5 y(t + 1)
w54
x2 (t)rs

w55
w42 4 w45
w44 x4

(b)

Figure 1.18: (a) Partially Connected Recurrent Networks (b) Fully Con-
nected Recurrent Networks

Such a feedback network, often called as recurrent network, becomes


a nonlinear dynamic system and exhibits a highly nonlinear dynamic be-
havior, which makes them highly interesting. Such a system has very rich
temporal and spatial behaviors, such as stable and unstable fixed points
and limit cycles, and chaotic behaviors. These behaviors can be utilized
to model certain cognitive functions, such as associative memory, unsuper-
vised learning, self-organizing maps, and temporal reasoning. In this sec-
tion, the recurrent networks will be presented from a system identification
perspective. Readers can refer following works [williams:89, williams:89b,
pearlmutter:89, sato:90], based on which the subject matter of this section
has been developed.

36
Consider a system expressed in discrete dynamics:

y(t) = f (y(t − 1)) + g(y(t − 1))u(t − 1) (1.74)

where y(t) may be a scalar variable or a vector, f (·) and g(·) are two arbitrary
nonlinear functions of y(t − 1). The feed-forward network will learn this
dynamics as
ŷ(t) = fˆ(y(t − 1), u(t − 1)) (1.75)
where y(t − 1) is the actual system output at t − 1 instant. Whereas the
recurrent network learns this map as

ŷ(t) = fˆ(ŷ(t − 1), u(t − 1)) (1.76)

which indicates that the network has internal memory, i.e. the present out-
put state is a function of previous output of the network. Figure 1.19 sum-
marizes the difference between feed-forward network and feedback net-
work.

Feed Forward Network (FFN) Recurrent Network (RN)


y(t − 1)
FFN ŷ(t) u(t − 1) RN ŷ(t)
u(t − 1)

y(t − 1) is obtained from the This model is very much like a


actual system: static system dynamic system where system
responds to external input only
(a) (b)

Figure 1.19: (a) Feed-forward Network(FFN) and (b) Recurrent Network


(RN).

1.6.1 Response of Recurrent Networks


Let us find the response y(t + 1) of the network shown in Figure 1.18 (b).
The figure shows a fully connected network. The following convention will
be followed to describe a recurrent structure. The neurons are serially num-
bered (including input neurons). The input to the ith neuron is denoted as
si . The response of the ith neuron is denoted as xi (this also includes input
units). The connection weight from jth neuron to ith neuron is represented
as wi j . Using the above notations, the input to the ith neuron can be com-
puted as

si (t + 1) = ∑ wi j x j (t)
j

37
where t represents temporal argument. The response of the above unit
can be expressed as

xi (t) = f (si (t + 1))


where f(.) represents sigmoid activation function i.e.
1
f (h) = .
1 + e−h
For the given network as in Figure 1.18 (b), response of the unit number
3 can be given as

s3 (t + 1) = w32 x2 (t) + w33 x3 (t) + w34 x4 (t) + w35 x5 (t)


5
= ∑ w3ixi(t)
i=1
x3 (t + 1) = f (s3 (t + 1))

Similarly, for 4th unit, the response can be given as

x4 (t + 1) = f (s4 (t + 1))
5
s4 (t + 1) = ∑ w4ixi(t)
i=1

The response of the output unit (5th unit) can now be given as

y(t + 1) = x5 (t + 1) = f (s5 (t + 1))

1.6.2 Learning Algorithms


Different learning algorithms are available for recurrent networks. Two
popular algorithms among them are
1. Back Propagation Through Time
2. Real Time Recurrent Learning

1.6.3 Back Propagation Through Time (BPTT)


BPTT is an extension of the standard back-propagation algorithm. It is char-
acterized by unfolding the temporal operation of the recurrent network into
a multi-layered feed-forward network, with a new layer added at every time
step. Consider the recurrent network shown in Figure 1.20 (a). The output
of the network is given as

y(t + 1) = f (w x(t) + g y(t))

38
w
x(t) y(t + 1)
g
(a)

x(0) x(1) x(2) x(3) x(4)


w w w w w
y(0) y(1) y(2) y(3) y(4) y(5)
g g g g g
t =1 t =2 t =3 t =4 t =5
e(1) e(2) e(3) e(4) e(5)
(b)
Figure 1.20: (a) Original Network with a single neuron and (b) Unfolded
Network.

where f is sigmoidal activation function, w is the connection weight be-


tween input and the output neuron and g is the self feed-back connection
weight between the neuron and itself. The unfolded network with t = 5 is
shown in Figure 1.20 (b). The unfolded network is a feed-forward network
where the basic network repeats every time step. For example, in the first
time step, the output y(1) is a function of previous state and input y(0) and
u(0) respectively.

Example 1.7. Unfold the fully recurrent network given in Figure 1.18 (b)
in time.
Solution
The unfolded network until time-step t = 4 is shown in Figure 1.21.

x1 rs rs rs rs rs

x2 rs rs rs rs rs

x3

x4

x5
t =0 t =1 t =2 t =3 t =4

Figure 1.21: Unfolded version of the fully connected network.

39
Example 1.8. Enumerate steps for training the unfolded network of the
simple recurrent network given in Figure 1.20.
Solution
The steps to proceed further is described as follows:

1. Compute the response of the sequence y(1) to y(5), given the sequence
x(0) to x(4) and y(0).

2. Compute the error e(5) = yd (5) − y(5)


∆w5 = ηδ5 x(4)
∆g5 = ηδ5 y(4), where δ5 = y(5)(1 − y(5))e(5)

3. Compute e(4), e(3), e(2) &e(1) and proceed the same way until the
first node t = 1
∆w4 = ηδ4 x(3)
∆g4 = ηδ4 y(3)
where δ4 = y(4)(1 − y(4))[δ5 g + e(4)] Similarly,
∆w3 = ηδ3 x(2)
∆g3 = ηδ3 y(2), δ3 = y(3)(1 − y(3))[δ4 g + e(3)]
∆w2 = ηδ2 x(1)
∆g2 = ηδ2 y(1), δ2 = y(2)(1 − y(2))[δ3 g + e(2)]
∆w1 = ηδ1 x(0)
∆g1 = ηδ1 y(0), δ1 = y(1)(1 − y(1))[δ2 g + e(1)]

The update rule is

5
w new
=wold
+ ∑ ∆wi
i=1
5
gnew = gold + ∑ ∆gi
i=1

System Identification using BPTT

Consider the following discrete time linear system

y(t + 1) = −0.5y(t) − y(t − 1) + 0.5u(t) (1.77)

Since the system dynamics is linear, a recurrent network, as shown in Fig-


ure 1.22, with linear activation function is taken to learn the dynamics. The
network response is given as

y(t + 1) = w1 y(t) + w2 y(t − 1) + w3 u(t) (1.78)

40
u(t) y(t + 1)
w3
w1 z−1
w2
z−2

Figure 1.22: A recurrent network that can learn the linear dynamics given
in (1.77).

u(1) u(2) u(n)


rs rs rs

w3 w3 w3
y(1) w1 y(2) w1 y(3) y(n) w1 y(n + 1)
b b b
··· b b

w2 w2 w2
1 1 1
b b b
··· b b

y(0) y(1) y(2) y(n − 1) y(n)

Figure 1.23: Unfolding of the network in Figure 1.22.

where w1 , w2 and w3 are unknown weights. An weight update algorithm us-


ing BPTT needs to be derived using the same principle of back-propagation
so that these parameters converge to the actual parameters associated with
the dynamics (1.77). The network in Figure 1.22 is unfolded and the un-
folded network in time is shown in Figure 1.23. The response of the net-
work is computed at different time steps as follows:

y(2) = w1 y(1) + w2 y(0) + w3 u(1)


y(3) = w1 y(2) + w2 y(1) + w3 u(2)
..
.
y(n + 1) = w1 y(n) + w2 y(n − 1) + w3 u(n)

The network has to be trained using generalized delta rule as explained


in the previous example. 100 data points have been generated using the
dynamic model of the system (1.77) where input u(t) is generated randomly
between 0 to 1. The training data are shown in Figure 1.24. From this figure,
it is shown that the output range is between −2 and +2. The network is
trained using BPTT. The RMS error vs number of epochs is shown in the

41
Input
Output
2

I/O data
0

−1

−2
0 20 40 60 80 100
time step

Figure 1.24: Training data for learning the dynamics (1.77)

left of Figure.1.25. The weights are found to be

w1 = −0.4999 w2 = −1.0002 and w3 = 0.4995

The evolution of the weights during training is shown in the right of Figure
1.25.

Final RMS error =0.005


0.4 0.5

w1
0.35 w2
0.25
w3
0.3
0
0.25
RMS error

weights

0.2 −0.25

0.15
−0.5

0.1

−0.75
0.05

0 −1
0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000
no. of epochs no. of epochs

Figure 1.25: System Identification using BPTT: Left figure shows the error
plot and right figure shows evolution of the weights

1.6.4 Real Time Recurrent Learning (RTRL)


In BPTT, the network is rolled over in time to construct a feed-forward net-
work. Generalized delta rule is applied to update the weights. Thus BPTT

42
algorithm is an off-line technique. In contrast in RTRL, the gradient infor-
mation at t is forward propagated to the next time step t + 1. The algorithm
can be implemented in real-time.

RTRL : A simple example

Consider the recurrent network with a single neuron as shown in Figure


1.20 where x(t) is the input vector at time t and y(t + 1) is the output vector
at time t + 1. Forward response of the network is given as

y(t + 1) = f (s(t + 1))


1
where s(t + 1) = wx(t) + gy(t) and f (h) = 1+e−h
.
The cost function to be minimized is defined as
1
E(t + 1) = (yd (t + 1) − y(t + 1))2 .
2
The network has two parametric weights w and g. Using gradient descent
algorithm one can write

∂ E(t + 1)
w(t + 1) = w(t) − η
∂w
∂ E(t + 1)
g(t + 1) = g(t) − η .
∂g
Differentiating E with respect to the synaptic weight w

∂ E(t + 1) ∂ y(t + 1)
= −[yd (t + 1) − y(t + 1)] .
∂w ∂w
Let us define two variables:
∂ y(t + 1)
Pw (t + 1) = with Pw (0) = 0 and
∂w
∂ y(t + 1)
Pg (t + 1) = with Pg (0) = 0
∂g

The objective is to derive a recursive relation for Pw (t + 1) and Pg (t + 1) in


terms of Pw (t) and Pg (t).
∂ y(t+1)
Compute ∂w as

∂ y(t + 1) ∂ y(t + 1) ∂ s(t + 1)


= × .
∂w ∂ s(t + 1) ∂w

43
1
Since y(t + 1) = 1+e−s(t+1)
, one can write

∂ y(t + 1)
= y(t + 1)(1 − y(t + 1)).
∂ s(t + 1)
∂ s(t+1)
Similarly ∂w can be computed as

∂ s(t + 1) ∂
= [wx(t) + gy(t)]
∂w ∂w
∂ y(t)
= x(t) + g = x(t) + gPw (t).
∂w

∂ y(t + 1)
Therefore, = y(t + 1)(1 − y(t + 1))[x(t) + gPw (t)].
∂w
In a similar fashion one can compute

∂ y(t + 1)
= y(t + 1)(1 − y(t + 1))[y(t) + gPg (t)]
∂g
which implies

∂ E(t + 1)
= −[yd (t + 1) − y(t + 1)]y(t + 1)
∂w
(1 − y(t + 1))[x(t) + gPw (t)]
∂ E(t + 1)
= −[yd (t + 1) − y(t + 1)]y(t + 1)
∂g
(1 − y(t + 1))[y(t) + gPg (t)].

The weight update laws thus can be written as

w(t + 1) = w(t) + η [yd (t + 1) − y(t + 1)]Pw (t + 1)


g(t + 1) = g(t) + η [yd (t + 1) − y(t + 1)]Pg (t + 1)

where the values of Pw (t +1) and Pg (t +1) are computed using the following
set of relations:

Pw (t + 1) = y(t + 1)(1 − y(t + 1))[x(t) + gPw (t)]


Pg (t + 1) = y(t + 1)(1 − y(t + 1))[x(t) + gPg (t)]
∂ y(t) ∂ y(t)
where Pw (t) = & Pg (t) = .
∂w ∂g

Thus Pw (1) = y(1)(1−y(1))[x(0)+gPw (0)], where Pw (0) = 0 and so on. Sim-


ilarly for Pg (.).

44
w22
x2
2 w24
w21
w42
x4
x1 (t)rs
w43 4 y(t + 1)
w31 w44
3 w34
w33 x3

Figure 1.26: A real-time recurrent network.

Example 1.9. Derive the RTRL algorithm for the network shown in Figure
1.26.
Solution
Forward response of the network is computed as follows:

s2 (t + 1) = w21 x1 (t) + w22 x2 (t) + w24 x4 (t)


s3 (t + 1) = w31 x1 (t) + w33 x3 (t) + w34 x4 (t)
s4 (t + 1) = w42 x2 (t) + w43 x3 (t) + w44 x4 (t)

xi (t + 1) = f (si (t + 1)), for i = 2, 3, 4


1
y(t + 1) = x4 (t + 1), f (z) = .
1 + e−z
The Weight update law is given by

∂ E(t + 1)
w jk (t + 1) = w jk (t) − η
∂ w jk

where E(t + 1) = 12 (yd (t + 1) − y(t + 1))2 ,

∂ E(t + 1) ∂ y(t + 1)
= −(yd (t + 1) − y(t + 1)), .
∂ w jk ∂ w jk

∂ y(t + 1) ∂ x4 (t + 1) 4
Let us define, = = Pjk (t + 1). One can write
∂ w jk ∂ w jk
4
Pjk (t + 1) = f ′ (s4 (t + 1))[w42 Pjk
2 3
(t) + w43 Pjk 4
(t) + w44 Pjk (t) + δ jk xk (t)]

where δ jk = 1 if j = 4 and f ′ (s4 (t + 1)) = y(t + 1)(1 − y(t + 1)). f ′ denotes

45
the derivative of f with respect to its argument. Similarly one can write
3
Pjk (t + 1) = f ′ (s3 (t + 1))[w33 Pjk
3 4
(t) + w34 Pjk (t) + δ jk xk (t)]
2
Pjk (t + 1) = f ′ (s2 (t + 1))[w22 Pjk
2 4
(t) + w24 Pjk (t) + δ jk xk (t)].

Again
4
Ppq (t + 1) = f ′ (s4 (t + 1)) = [w42 Ppq
2 3
(t) + w43 Ppq 4
(t) + w44 Ppq (t) + δP4 xq (t)]

when P = 4 and δP4 = 1, and


3
Ppq (t + 1) = f ′ (s3 (t + 1)) = [w33 Ppq
3 4
(t) + w34 Ppq (t) + δ3P xq (t)]
2
Ppq (t + 1) = f ′ (s2 (t + 1)) = [w22 Ppq
3 4
(t) + w24 Ppq (t) + δ2P xq (t)].

For example let us compute w31 (t + 1).

∂ E(t + 1)
w31 (t + 1) = w31 (t) − η
∂ w31
∂ E(t + 1) ∂ y(t + 1)
= −[yd (t + 1) − y(t + 1)]
∂ w31 ∂ w31
= −[y (t + 1) − y(t + 1)]P31 (t + 1).
d 4

Therefore

w31 (t + 1) = w31 (t) + η [yd (t + 1) − y(t + 1)]P31


4
(t + 1)

where
4
P31 (t + 1) = y(t + 1)[1 − y(t + 1)][w42 P31
2 3
(t) + w43 P31 4
(t) + w44 P31 (t)].
i (t) = 0 for i = 2, 3, 4.
Similarly one can proceed for P31

System identification using RTRL

Consider the discrete time linear system as described by equation (1.77). A


recurrent network with linear activation function is chosen to identify the
system dynamics. The response of the network is given as

y(t + 1) = w1 y(t) + w2 y(t − 1) + w3 u(t)

where w1 , w2 and w3 are unknown weights. An weight update algorithm


is to be found using RTRL so that these parameters converge to the actual
system parameters associated with the dynamics (1.77). The cost function
is taken as E(t +1) = 12 (yd (t +1)−y(t +1))2 . The weight update law is given

46
as
∂ E(t + 1)
wi (t + 1) = wi (t) − η for i = 1, 2, 3.
∂ wi
The partial derivatives of the cost function with respect to the weights are
computed as follows:

∂ E(t + 1)
= −(yd (t + 1) − y(t + 1))Pw1 (t + 1)
∂ w1
where
∂ y(t + 1)
Pw1 (t + 1) = = y(t) + w1 Pw1 (t) + w2 Pw1 (t − 1).
∂ w1
Similarly

∂ E(t + 1)
= −(yd (t + 1) − y(t + 1))Pw2 (t + 1)
∂ w2
∂ y(t + 1)
Pw2 (t + 1) = = y(t − 1) + w1 Pw2 (t) + w2 Pw2 (t − 1)
∂ w2
∂ E(t + 1)
= −(yd (t + 1) − y(t + 1))Pw3 (t + 1)
∂ w3
∂ y(t + 1)
Pw3 (t + 1) = = u(t) + w1 Pw3 (t) + w2 Pw3 (t − 1)
∂ w3
100 input-output data points, as generated for the same example using BPTT,
have been used to train the network. The training data are shown in Figure
1.24. Using RTRL for one sequence of time, the final weights are found to
be
w1 = −0.5000 w2 = −0.9999 and w3 = 0.5010
Figure 1.27(a) shows the RMS error vs number of epochs while Figure
1.27(a) shows evolution of the weights during training.

1.7 Kohonen Self Organizing Map


Kohonen [Kohonen:01] proposed an unsupervised learning algorithm that
can form clusters for a given data set while preserving topology. A sim-
ple configuration of Kohonen self-organizing feature map is illustrated in
Fig. 1.28. The most prominent feature is the concept of excitatory learning
with a neighborhood around the winning neuron. The size of the neighbor-
hood slowly decreases with each iteration, see for example Figure 1.28(b).
A more detailed description is provided below: The first step is to initial-
ize the synaptic weights in the network. There are three essential processes

47
Final RMS error=0.0009 (RTRL) Training Algorithm : RTRL
1.4

1.2 0.6

1
0.2 w
1
RMS error

w
0.8 2

weights
w3

0.6 −0.2

0.4
−0.6
0.2

0 −1
0 500 1000 1500 2000 0 500 1000 1500 2000
no. of epochs no. of epochs
(a) (b)

Figure 1.27: System Identification using RTRL: (a) error plot (b) evolution
of the weights

Winning Neighborhood
neuron
Two-dimensional
lattice of neurons

Connections between
input and each neuron
Input x (N × 1)

Figure 1.28: A two dimensional self organizing feature map. (By updating
all the weight connecting to a neighborhood of the target neurons, it en-
ables the neighboring neuron to become more responsive to the same input
pattern. Consequently, the correlation between neighboring nodes can be
enhanced. Once such a correlation is established, the size of a neighbor-
hood can be decreased gradually, based on the desire of having a stronger
identity of individual nodes.)

involved in the formation of the SOM.

48
• Competition: For each input pattern, the neurons in the network
compute their respective values of a discriminant function. The neu-
ron with the largest value of that function is declared winner.

• Cooperation: The winning neuron determines the spatial location


of a topological neighborhood of excited neurons, i.e. cooperative
neighboring neurons.

• Synaptic Adaptation: This enables the excited neurons to adjust their


synaptic weights in relation to the input pattern.

Competitive Process

Let m is the dimension of the input (data) space and weight vector. Let a
randomly chosen input pattern (vector) is

x = [x1 , x2 , ..., xm ]T .

Let the synaptic weight vector of neuron j be denoted by

w j = [w j1 , w j2 , ..., w jm ]T , j = 1, 2, ..., l

where l=total number of neurons in the network.


Finding the best match of the input vector x with the synaptic weight
vectors w j is mathematically equivalent to minimizing the Euclidean dis-
tance between the vectors x and w j .
Let i(x) = index to identify the neuron that best matches x,

i(x) = arg min ||x − w j ||, j = 1, 2, ..., l.


j

Cooperative Process

The winner neuron tends to excite the neurons in its immediate neighbor-
hood more than those farther away from it. Let h j,i denotes the topological
neighborhood centered on winning neuron i and di, j denotes the lateral dis-
tance between winning neuron i and excited neuron j.

• The topological neighborhood h j,i is symmetric about the maximum


point defined by di, j = 0. In other words it attains its maximum value
at the winning neuron i for which the distance d j,i is zero.

• The amplitude of the topological neighborhood h j,i decreases mono-


tonically with increasing lateral distance d j,i decaying to zero for d j,i →
∞ which is a necessary condition for convergence.

49
A typical choice of h j,i that satisfies these requirements is the Gaussian
function as shown in Figure 1.29. The expression of a Gaussian neighbor-
hood function is given as

Figure : Gaussian neighbourhood function


1

0.9

0.8

0.7

0.6
j,i

0.5
h

0.4

0.3

0.2

0.1

0
−50 0 50
d j,i

Figure 1.29: Gaussian neighbourhood function

!
d 2j,i
h j,i(x) (n) = exp − , n = 0, 1, 2, ...
2σ 2 (n)

where σ (n) is the width of neighborhood function. During the training


process, the width can vary in the following manner:
 
n
σ (n) = σ0 exp −
τ1

σ0 =initial value of σ
τ1 =time constant of width function
n =number of training step

One-Dimensional Lattice

For one-dimensional lattice d j,i is measured as

d j,i = | j − i|

Consider the above one dimensional Kohonen lattice. Let index of winning
neuron i is 3 and index of two neighborhood neurons are 2 and 5. Thus,

d8,5 = |2 − 3| = 1
d3,5 = |5 − 3| = 2.

50
1 2 3 4 5 6 7 8

Winning neuron

Two-Dimensional Lattice
For two-dimensional lattice d 2j,i is measured as

d 2j,i = ||r j − ri ||2

where

ri = position of winning neuron i


r j = position of neighborhood neuron j

Let index of winning neuron i is {2, 3} and index of neighborhood neuron

1 2 3 4 5
1

Winning neuron

j is {4, 2}, then,

d j,i = (4 − 2)2 + (2 − 3)2 = 5.

Adaptive Process
Weights associated with winning neuron and its neighbors are updated as
per a neighborhood index. Winning neuron is allowed to be maximally
benefited from this weight update while the neuron that is farthest from
the winner is minimally benefited. The change in weight in each training

51
step is given by

∆w j (n) = η (n)h j,i(x) (n)(x − w j (n))


n
η (n) = η0 exp(− ), n = 0, 1, 2, ...
τ2
where
η (n)=learning rate parameter
η0 =initial value of learning rate parameter
τ2 =time constant of learning rate

The updated weight vector w j (n + 1) at time (n + 1) is defined by

w j (n + 1) = w j (n) + η (n)h j,i(x) (n)(x − w j (n))

This equation will move the weight of winning neuron wi towards the input
vector x.

Example 1.10. 1 − D SOM learns 2 − D topology


Solution
In the simulation a neural network is chosen with 100 neuron organized
in one dimensional lattice. The network is trained with a two dimensional
input vector x.

• Consider input data coming randomly from a 2 − D topology.

• Since each data point is two dimensional, x = [x1 x2 ]T , where


x1 represent x coordinate
x2 represent y coordinate

• w j associated with each neuron is also two dimensional.

The training is done for 6000 iterations.

m = 2
x = [x1 , x2 ]T
w j = [w j,1 , w j,2 ]T ; j = 1, 2, ..., 100

The weights are initialized from a random set


(−0.04 < w j,1 < 0.04 and − 0.04 < w j,2 < 0.04). The input x is uniformly
distributed in the region (0 < x1 < 1 and 0 < x2 < 1). Figure 1.30 shows the
input data space. Figure 1.31 shows the network weights before training
and Figure 1.32 shows that the weights of the network preserve the topol-
ogy of the input space. Next we will consider the case where the input data
are coming randomly from an ‘L’ shaped input space.
The weights are initialized from a random set

52
1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.2 0.4 0.6 0.8 1

Figure 1.30: Input space for training

0.04

0.03

0.02

0.01

−0.01

−0.02

−0.03

−0.04
−0.04 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 0.04

Figure 1.31: Initialization of weight for training

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.2 0.4 0.6 0.8 1

Figure 1.32: Weights after the completion of training

53
1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.2 0.4 0.6 0.8 1

Figure 1.33: Input space for training

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.2 0.4 0.6 0.8 1

Figure 1.34: Weights after the completion of training

(−0.04 < w j,1 < 0.04 and − 0.04 < w j,2 < 0.04). The input x is uniformly
distributed in the region: (0 < x1 < 1 and 0 < x2 < 0.3), (0 < x1 < 0.3 and 0.3 <
x2 < 1). The training is done for 6000 iterations. Figure 1.33 shows the in-
put data space. Figure 1.31 shows the network weights before training and
Figure 1.34 shows that the weights of the network preserve the topology of
the input space.

Example 1.11. 2 − D SOM learns 2 − D topology


Soluion
In the simulation a Neural Network is chosen with 100 neuron organized
in two dimensional lattice with 10 rows and 10 column. The network is

54
trained with a two dimensional input vector x. Thus,

m = 2
x = [x1 , x2 ]T
wi, j = [wi, j,1 , wi, j,2 ]T ; i = 1, 2, ..., 10 j = 1, 2, ..., 10

The weights are initialized from a random set (−0.08 < wi, j,1 < 0.08 and −
0.08 < wi, j,2 < 0.08). The input x is uniformly distributed in the region
(0 < x1 < 1 and 0 < x2 < 1). Figure 1.30 shows the input data space. Figure
1.31 shows the network weights before training and Figure 1.35 shows that
the weights of the network preserve the topology of the input space. Next

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.2 0.4 0.6 0.8 1

Figure 1.35: Weights after the completion of training

we will show the topology preservation of an L shaped input space with the
same two-dimensional lattice. Figure 1.33 shows the input data space. Fig-
ure 1.31 shows the network weights before training and Figure 1.36 shows
that the weights of the network preserve the topology of the input space.

1.8 System Identification using Neural Networks


In control, the mathematical model of the plant dynamics is derived using
the knowledge of physics, chemistry or biology that governs the plant dy-
namics. The model needs to be close enough to the actual dynamics so that
the controller designed based on the mathematical model can guarantee
stability of the actual system and can provide good performance. However,
it is difficult to derive accurate mathematical models for systems such as
chemical processes, a robot operating in an unstructured environment, a
moving aircraft being subjected to uncertain forces and visually guided ma-
nipulators. In such situations, the designer takes help of the experimental

55
1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.2 0.4 0.6 0.8 1

Figure 1.36: Weights after the completion of training

data directly conducted to observe input-output response of the plant. The


process of deriving models from experimental data is called system identifi-
cation. With the coming of fast and powerful computing processors, system
identification of nonlinear plants has become easier.
System model representations have been discussed in Chapter 1. It is
convenient to model systems using discrete-time representation using feed-
forward networks. Although various forms of discrete-time models are avail-
able [narendra:90], non-affine and input-affine forms are two typical rep-
resentations. The non-affine form is given as

x(k + 1) = f (x(k), u(k)) (1.79)

where u(·) ∈ Rm and x(·) ∈ Rn are input and state vectors respectively. When
the function f in (1.79) is unknown, the problem of system identification can
be formally stated as
Suppose that the dynamics of a causal, time invariant discrete time plant is
described by the equation (1.79), where the input u(·) is a uniformly bounded
signal. The plant is assumed to be stable. Then a feed-forward network N(·)
approximately represents the function f if N(·) predicts an output x̂(k + 1),
given the input sequence u(k) and previous system state x(k) such that ∑Pk=1 ∥x(k +
1) − x̂(k + 1)∥ ≤ ε for a small desired ε > 0 where P is number of discrete-time
instants.
Figure 1.37 shows learning framework for generating a neural system
model of (1.79). One can select either an MLN or an RBF network for neu-
ral representation. When the system is in input affine form, the governing
equation becomes

x(k + 1) = f(x(k)) + G(x(k))u(k). (1.80)

56
x(k)
Neural Network x̂(k + 1)
u(k)

e(k + 1) _ x(k + 1)
+

Figure 1.37: Non-affine system model using feed-forward networks

The system in (1.80) can be approximated using two neural networks as


shown in Figure 1.38.

f̂(x(k))
x(k)
NN ∑ x̂n (k + 1)
+
+ −
∑ xn (k + 1)
+
ĝ(x(k))
NN ∏

en (k + 1)
u(k)

Figure 1.38: Affine system model using feed-forward networks

Example 1.12. The plant is described by the following difference equation:

x(k + 1) = 0.3x(k) + 0.6x(k − 1) + g[u(k)]

where g has a form g(u) = 0.6 sin(u) + 0.3 sin(3π u) + 0.1 sin(5π u). The un-
forced linear system is asymptotically stable. Derive a neural model using
a series-parallel representation

x̂(k + 1) = 0.3x(k) + 0.6x(k − 1) + N[u(k)]

where the unknown function g is approximated by a neural network N.


Solution
The neural network is a three layered network with two hidden layers of 20

57
and 10 neurons respectively. The weights of the neural network N are ad-
justed using instantaneous back propagation algorithm with a learning rate
of 0.25. The input is randomly chosen from the uniform interval [−1, 1].
The identification is carried out for 500 time steps and the result is shown in
Figure 1.39. The RMS error between the output of the model and the plant
is found to be 0.003.
8
Plant output
Model output

6
Plant and model output

−2

−4
0 100 200 300 400 500
Time step

Figure 1.39: Response of the plant and the neural model after training

Example 1.13. Consider a MIMO plant which is governed by the following


difference equation:
 
  x1 (k)  
x1 (k + 1)  1+x2 (k) 
2 u1 (k)
 = + 
 
x2 (k + 1) x1 (k)x 2 (k) u2 (k)
2 1+x2 (k)

Derive a neural model using the following series-parallel representation:


   1   
x̂1 (k + 1) N [x1 (k), x2 (k)] u1 (k)
 = + 
x̂2 (k + 1) 2
N [x1 (k), x2 (k)] u2 (k)

Solution
The identification is carried out for random inputs u1 and u2 which are uni-
formly distributed in the interval [−1, 1]. N 1 and N 2 correspond to two
three-layered networks with hidden units of 15 and 10 respectively. The
learning rate is taken as 0.1. Figure 1.40 shows the identification results.
Example 1.14. Lorentz Attractor Problem
Solution

58
4 3
Plant output 1 Plant output 2
3 Model output 1 Model output 2
Output 1 of the plant and the model

Output 2 of the plant and the model


2
2

1
1
0

−1 0

−2
−1
−3

−4
−2
−5

−6 −3
0 20 40 60 80 100 0 20 40 60 80 100
Time step Time step
(a) (b)

Figure 1.40: Responses of the plant and the identification model

The Lorentz Attractor refers to the solutions of the system of differential


equations as proposed by Lorentz for modeling the motion of convective
fluids. The differential equations are described as follows:

∂x 

= σ (y − x) 

∂t 


∂y
= x(r − z) − y (1.81)
∂t 



∂z 

= xy − bz 
∂t
The parameters of this system of equations are σ , r and b which are fixed

Figure 1.41: Three dimensional plot of the Lorentz Attractor

59
Figure 1.42: x − y projection of the Lorentz Attractor

for a specific problem. For our study the values of the parameters are taken
as σ = 16 , r = 45.92 and b = 4. The solutions of this system of differential
equations cannot be found by methods of analytic integration and hence the
solutions are found by the method of Runge-Kutta Fourth Order ODE solver
algorithm which numerically solves for the solution with a good amount of
accuracy.
The solution of these differential equations traces out a path in the three
dimensional phase space. The three dimensional plot of the Lorentz Attrac-
tor set solved by the Runge-Kutta method is shown in Figure 1.41. A plot of
the projection on the x − y plane is shown in Figure 1.42. Similarly a plot of
the Lorentz Attractor Set on the x-axis is also shown in Figure 1.43. It shows
the variation of the x-signal with time. All the three plots are obtained for
a set of 30000 data points, numerically calculated using the fourth order
Runge-Kutta differential solving model. 

In the following sections, we shall model this attractor using modified


forms of the Radial Basis Function Network. This network is more or less
completely similar to the usual Radial Basis Function Network described
earlier in this chapter except for the learning rule used for the centers. The
center learning in this network is modified to accommodate the cluster-
ing algorithm, inspired from the k-means clustering algorithm based on
the philosophy “winner takes all”. Here a similar method is applied, but
a “winner takes most” paradigm is used. Another change from the usual
clustering algorithm is that here the clustering is done in a higher dimen-
sional space. The use of the input-output clubbed for clustering is not a
new idea. It has been previously researched and its benefits in the RBF
Networks were reported in [koivo]. The input space and the output space
and clubbed together and the learning algorithm is applied in this higher
dimensional space and later the projection of the learned vectors on the in-
put space is taken as the centers. Therefore the locations of the centers are
not only influenced by the input-vectors but also by the output sample de-

60
viations. The learning in this network is done in two stages. In the first
stage, only center learning is done. We start by placing some random cen-
ters in the clubbed input-output space. Then a clustering algorithm is used
to modify their positions such that the input-output vectors are equally clus-
tered among the centers of the neurons. After this the a projection of the
clubbed input-output vectors is taken on the input space, which provides
the network with the required centers. Then a gradient descent algorithm
is applied to learn the weight vectors.
For testing the performance of the Radial Basis Function Network, a 3-
input and 3-output neural network was used to first train and then predict
the Lorentz Attractor. This network consisted of 700 centers. The values of
σ (the neighborhood width) and η (the learning rate) were taken as 0.3 and
0.25 respectively. The training set consisted of 10, 000 data points which

Figure 1.43: Plot of the x-signal of the actual Lorentz Attractor with time

0.8
’a’ u 1

0.6

0.4

0.2

-0.2

-0.4
0 5000 10000 15000 20000 25000 30000

Figure 1.44: Plot of the x-signal coming from the trained RBFN.

were randomly given to the network. After training the network was used to
predict 30, 0000 data points of the Lorentz Attractor. The results of this are
shown in figures here. The Figure 1.44 denotes the x-signal with time. This
figure can be compared with Figure 1.43 to make a qualitative judgment

61
of the effectiveness of the network. The mean square error produced in x-
signal estimation is 0.0791083, for the y-signal is 0.062171 , for the z-signal
is 0.0588676, while the total error is 0.116514 . It is seen that this network
architecture has been successful in capturing the dynamics of the Lorentz
attractor to some extent, but certainly it does leave scope for improvement.
This network demonstrates the power of the general class of Radial Basis
Neural Networks. The RBFN performance can be improved if the clus-

Figure 1.45: 3-D plot of the predicted Lorentz System using RBFN with
higher order clustering
.

Figure 1.46: x-signal plot of the predicted Lorentz System using RBFN with
higher order clustering.

tering is done in a higher dimensional space. The use of the input-output


clubbed for clustering has been previously researched and its benefits in the
RBF Networks were reported in [koivo]. The input space and the output
space and clubbed together and the learning algorithm is applied in this
higher dimensional space and later the projection of the learnt vectors on
the input space is taken as the centers. Therefore the locations of the cen-
ters are not only influenced by the input-vectors but also by the output sam-
ple deviations. The learning in this network is done in two stages. In the

62
first stage, only center learning is done. We start by placing some random
centers in the clubbed input-output space. Then a clustering algorithm is
used to modify their positions such that the input-output vectors are equally
clustered among the centers of the neurons. After this the a projection of the
clubbed input-output vectors is taken on the input space , which provides
the network with the required centers. Then a gradient descent algorithm
is applied to learn the weight vectors. The number of centers in the network
was 500 and the values of σ (the neighborhood width) and η (the learning
rate) were taken as 1.5 and 0.30 respectively. Here also a training set of
10, 000 data points was given to the network randomly. After training the
network was used to predict 30, 0000 data points of the Lorentz Attractor
given the same initial points. The results of this are shown in the plots. Fig-
ure 1.45 shows the three dimensional plot of the predicted Lorentz attractor
after the training was complete. Similarly Figure 1.46 shows the behavior
of the x-signal obtained from the RBFN using higher order clustering. The
mean square errors(MSE) for this network were as follows - MSE in x-signal
= 0.0632505; MSE in y-signal = 0.0818714; MSE in z-signal = 0.043263;
while the total MSE = 0.0112162.

1.9 SOM based Identification


Even though Kohonen Self Organizing Maps, as described have been used
extensively for unsupervised learning and capturing patterns in the given
data, its use for the purpose of a function approximation is relatively new.
The approach of using the Kohonen Network for supervised learning and
function approximation can be found in detail in [behera]. Generally, each
neuron in the Kohonen lattice has a weight vector which is of dimension
three (because the inputs are from R3 ). Also associated with each neu-
ron is the output vector which denotes the output of the particular neuron.
The Kohonen Learning Algorithm is applied in the higher dimensional,
input-output clubbed space. At the end of the learning, the Kohonen learn-
ing algorithm would have learned the topology of the clubbed-input-output
space.
Thus, we can now define an output vector for each neuron in the final
Kohonen Lattice [behera]. The projection of the clubbed vectors on the in-
put space, gives the weight vector, while the projection on the output space
will give the associated output vector for each neuron. We use the following
notation : let wi j denote the weight vector of the (i, j)th neuron, θ i j be the
output vector of this neuron, Ai j be another 3 × 3 matrix, which is an ap-
proximation to the Jacobian of the function in the local space of the neuron.
The training of Ai j is done after the Kohonen learning in the IO space is fin-
ished. From the training data set, we find all those data points, for which
this neuron is the winner (i.e it is the closest). We randomly choose one

63
point, (uinp , uop ) from these set of points and update Ai j in the following
manner
i j = Ai j + η (t)g(uinp , wi j )∆Ai j
Anew old
(1.82)
where

∆Ai j = ∥∆v∥−2 I − Ai j ∆v ∆vT
∆v = uop − θ i j
η = learning rate
 
∥uinp − wi j ∥2
g(uinp , wi j ) = exp −
2σ 2
I = 3 × 3 Identity Matrix.

Therefore, for any given input u the corresponding output ui j from each
neuron is given by 
ui j = θ i j + Ai j u − wi j . (1.83)
We find the final output of the whole network by doing a weighted aver-
age over all the outputs from various neurons. The weight depends upon
the distance of the neuron to the “winner” neuron (corresponding to this
input). Thus, we get :

∑i, j h(i, j, i0 , j0 )wi j


uout = (1.84)
∑i, j h(i, j, i0 , j0 )

where
−||(i, j) − (i0 , j0 )||2
h(i, i0 ,t) = exp( )
2σ (t)2

Here, (i0 , j0 ) represent the “winner” neuron to the input u. In this particu-
lar problem σ (t) and η (t) are both high initially and then they are decreased
to very low positive value by equation 1.85.
  t
ε f inal tmax
ε (t) = εinitial (1.85)
εinitial

where ε ∈ {σ , η }.

Two Dimensional Kohonen Lattice

The use of the Hybrid structure of the Kohonen SOM as function approxi-
mation technique is described in Section 1.9. We will use this approach to

64
model the Lorentz Attractor as given in Example 1.14. A two-dimensional

Figure 1.47: Plot of the predicted Lorentz Attractor on the X-Y plane using
2-dimensional Kohonen Lattice.

Figure 1.48: Plot of the x-signal of the predicted Lorentz Attractor using
2-dimensional Kohonen Lattice.

network is considered for the purpose which consists of a 70×70 lattice and
the values of σinitial , σ f inal , ηinital and η f inal are taken as 40, 0.01, 0.90 and
0.01 respectively. The network is trained over a training set of 10000 data
points. Then this network is used to predict the Lorentz Attractor for 30000
data points. The results obtained for this network are shown in the next set
of figures and plots. From these plots the topological preserving nature of
the Kohonen Lattice is very well seen, in spite of the fact that this learning
was done in the higher dimensional space after clubbing the input-output
vectors. The next plot, Figure 1.49 shows the positioning of the weight vec-
tors of the neurons of the two dimensional Kohonen Lattice in the input
space after learning. The topology preservation is very clearly seen which
indicates that the network has efficiently learnt the data and has effectively
captured the dynamics of the time series.

65
Figure 1.49: Plot of the weight vectors in the input space after learning.

Figure 1.50: Plot of the predicted Lorentz Attractor on the X-Y plane using
a 3-dimensional Kohonen Lattice.

Three dimensional Kohonen Lattice

This network is very similar to that of the network described above in Sec-
tion 1.9. The only difference between this network and the one above is in
the dimension of the Kohonen Lattice. In this case a 3-dimensional lattice is
used instead of a 2-dimensional one. The output vectors are also learned by
the use of a Higher Order Clustering Algorithm, i.e., the input and output
vectors are clubbed and then the Kohonen Learning Rule is applied to these
vectors as described earlier. The network consists of a 15 × 12 × 12 lattice
and the values of σinitial ,σ f inal ,ηinital and η f inal are taken as 20, 0.01, 0.90
and 0.005 respectively. Once the training is over, the projection of these
vectors on the input space become the weight vectors of the neurons, while
the projected vectors on the output space form the output of the respective
neurons. The network is trained over a training set of 10000 data points.
Then this network is used to predict the Lorentz Attractor for 30000 data
points. It is intuitively felt that a 3-dimensional Kohonen Lattice should
perform better in predicting and capturing the dynamics of the Lorentz At-
tractor. This is indeed the case as demonstrated by the results obtained.

66
Figure 1.51: Plot of the predicted Lorentz Attractor on the X plane using a
3-dimensional Kohonen Lattice.

Figure 1.52: Plot of the weight vectors in the X-Y plane using 3-dimensional
Kohonen Lattice

The plots of the obtained results are shown here in Figures 1.50, 1.51 and
1.52. Readers should be able to appreciate the use of SOM network in sys-
tem identification as these results show that even chaotic attractors can be
modelled using such networks.

1.9.1 KSOM based mapping: An Example


We learned how a Kohonen self-organizing map works. In this section,
we will learn how this network can be used for learning any arbitrary map
f : x → y. That is, given data x, y, we can build a neural architecture around
KSOM that will learn the unknown map f (). The KSOM network that
learns this map f (.) is shown in Figure 1.54.
Let’s assume that the following nonlinear map is given:

y = f (x), x ∈ Rn , y ∈ R m . (1.86)

We will express this nonlinear function as aggregation of linear functions

67
Figure 1.53: KSOM network for system identification.

using first order Taylor series expansion. Given any input vector x0 ,

y0 = f (x0 ). (1.87)

Using first order Taylor series expansion, the output y can be expressed lin-
early around x0 as follows:

δf
y = y0 + |x=x0 (x − x0 ) (1.88)
δx
Let’s consider the following Kohonen lattice where each neuron is associ-
ated with the following linear model:

yγ = yγ + Aγ (x − wγ ) (1.89)

where given x, yγ is the linear response of the γ th neuron. This neuron is


associated with three parameters: wγ , the natural weight vector; yγ which
should converge to f (wγ ); and Aγ which is equivalent of δδ xf |x=wγ .
The linear response of each neuron given x has a weight of hγ where hγ

68
is the neighborhood function with respect to the winning neuron. Thus the
nonlinear map y = f (x) can be approximated as:

∑ hγ yγ
y= (1.90)
∑ hγ

where hγ = e−dγ /2σ and dγ is the lattice distance between the winning neu-
2 2

ron i and the γ th neuron.

The final expression for the network response can be given as:

∑ hγ (yγ +Aγ (x−wγ ))


y= . (1.91)
∑ hγ

As shown in Figure 1.54, the network has a collective response y when ex-
cited by the input pattern x where each neuron computes its own response
linearly. The readers must know that parameters associated with each neu-
ron wγ , Aγ , and yγ are unknown and are randomly initialized with very small
values. We will now derive the update laws for these parameters. Given that
wγ is the natural weight vector, its update will follow the same Kohonen
weight update algorithm:

wγ = wγ + η hγ (x − wγ ). (1.92)

Let the cost function be E = 12 ỹT ỹ, ỹ = yd − y and yd is the desired response
given x while y is the network response. The update law for the yγ can be
derived using gradient descent:

δE δy
= −ỹT
δ yγ δ yγ
 h 
γ
= −ỹT . (1.93)
∑ hγ

Thus the update law for yγ becomes:


 h 
γ
yγ ← yγ + η ỹ. (1.94)
∑ hγ

69
For the update law of Aγ , the gradient term is derived as:

δE δy
= −ỹT
δ Aγ δ Aγ

= −ỹT (x − wγ )
∑ hγ

=− (x − wγ )T . (1.95)
∑ hγ

Thus the update law becomes:


 h 
γ
Aγ ← Aγ + η ỹ(x − w) T
. (1.96)
∑ hγ

It is important to learn that KSOM based system identification makes use of


both unsupervised and supervised learning, which is called a type of hybrid
learning.

Example 1.15. Let’s consider the following map:


2 2
y1 = ex1 +x2 (1.97)

y2 = e x1 + x2 ∥ 2
(1.98)

Generate input data x = [x1 x2 ]T uniformly distributed in [0, 1]. Compute


the corresponding output y = [y1 y2 ]T . Take a 2-d lattice of size 5 × 5. Up-
date weights wγ , yγ and Aγ as given in equations (1.94), (1.95), and (1.96)
respectively.
Solution.
The KSOM network has 25 sets of parameters - each set of parameters as-
sociated with each neuron is wγ , yγ , and Aγ . These parameters are initially
uniformly randomly distributed in [0,1]. The network is excited by x which
is uniformly randomly generated in [0,1]. Using the corresponding desired
response yd and the network response y, weights are updated.
Five hundred training data sets are generated from this map and these data
are used to train the network over 200 epochs. The error convergence over
epochs is shown in Figure 3.5(a). During the testing, the input data is gen-
erated as x1 = 0.5 + 0.5 ∗ cos( k40π ) and x2 = 0.5 + 0.5 ∗ sin( k40π ). The test results
yd versus y are plotted in Figures 3.5 (b) and (c) respectively. One can see
from these figures that the actual network response is very accurately fol-
lowing the desired response. The rms tracking error for y1 and y2 are 3.5.b
and 3.5.c respectively. These results confirm that the KSOM network can
be used for learning any unknown map.

70
Figure 1.54: (a) Error convergence during the training; (b) y1 versus yd1 dur-
ing the testing; (c) y2 versus yd2 during the testing.

1.10 Summary
This chapter on neural networks is self-sufficient for the readers to grasp
the intelligent control concepts using neural networks covered in this book.
Network architectures and associated learning algorithms for multi-layered
networks, radial basis function networks, recurrent networks and SOM net-
works have been presented in a tutorial manner in this chapter. The use
of these networks for function approximations has been described through
many illustrative examples. The section on system identification allows
readers to understand the process of function approximation for nonlinear
systems including a chaotic system. The section on adaptive learning rate
shows that control theoretic concepts can be applied to network learning.
This section shows that both control theory and neural learning research
complement each other.

1.11 Exercises
1. Find the global minima of the following:

E(x1 , x2 ) = x1 4 + 2x12 x22 + x24


E(x) = 3x − 3x4 − 3x3 + 13x2

2. Derive the back-propagation algorithm for a MLN with three layers.


Use generalized delta rule.
3. Solve the symmetry problem i.e. design a MLN whose output will be
true (+1) if the input pattern is symmetric about the center and false
(-1) otherwise. Use a two layered network with six input and 2 hidden
units.
4. Show that the network shown in Figure 1.55 solves the XOR problem
by constructing decision regions and a truth table for the network.

71
x1 x2 x3

Figure 1.55: Multi-layered Networks

5. Can the back-propagation learning using a sigmoidal nonlinearity be


used to achieve the following one-to-one mapping?
(a) f (x) = 1x , 1 ≤ x ≤ 100
(b) f (x) = log10 x, 1 ≤ x ≤ 10
π
(c) f (x) = sin x, 1≤x≤ 2

Set up two sets of data, one for training and another for testing. Use
the training data set to compute the synaptic weights of the network.
Evaluate the computation accuracy of the network by using the test
data. Use a single hidden layer but variable number of hidden neu-
rons. Note the effect of the change in size of the hidden layer on the
network performance.
6. The MLN in the figure 1.56, uses a single hidden layer with sigmoidal
activation function.
w21

x1 w22
x2

x3 w23

x4
w24

Figure 1.56: Multi-layered Networks

(a) Find out the response of the network using forward pass.
(b) Find out the update law for the weights w21 and w24 .
7. Construct an RBFN with four hidden units, with each radial-basis
function center being determined by each piece of input data, to ob-
tain an exact solution to the XOR problem. The four possible input

72
patterns are defined by (0,0), (0,1), (1,1). (1,0), which represent the
cyclically ordered corners of a square.

(a) Construct the interpolation matrix for the resulting RBFN. Hence,
compute the inverse matrix Φ−1 .
(b) Calculate the linear weights of the output layer of the network.

8. Design an RBFN for the following set of training examples (x → y)

x1 = [−7 +2]T , y1 = +1
x2 = [−4 −2]T , y2 = −1

Use radial basis function ϕ (r) = (r2 + 9)0.5


{Hint: Fix the number of centers and corresponding centroid. No
center updating is required. Only find the value of weight vector.}

9. For an RBFN, find the update rule for the centers using gradient de-
scent technique while the basis function is assumed to be inverse
quadratic, i.e.,

1
Φ(r) = 1
(r2 + c2 ) 2

10. Realize an XOR function using radial basis function network whose
basis function is given as Ψ(r) = r2 log(r).

11. An RBFN has 5 radial centers. It has one input u and one output y.
The weight vector is denoted by w. The center of each unit is denoted
as ci . The radial basis function is Gaussian.

(a) Draw the architecture of the network.


(b) Write the response y of the network in terms of u, c and w.
(c) Derive the expression dy/du for the network.

12. Generate 5 sets of training data using the following model:

y(k)
y(k + 1) = u3 (k)
1 + y2 (k)

The input sequence is given as {0.5, 0.2, 0.05, 0.8, 0.35}. The radial
function is Gaussian. Draw the architecture of the RBFN network
with 5 radial centers. What are the radial centers? Find weights us-
ing peudo-inverse technique. Find weights by recursive gradient al-
gorithm. Compare results.

73
13. Train an RBFN to learn the quadratic function y = u2 . The network
has 50 centers and are randomly chosen between 0 and 1. The input
range is [0, 1]. The radial function used is type Gaussian ϕ (r) = e−r
2

(a) Find the response of the network when the input takes a value
from the set [0.2, 0.5, 0.6, 0.9]. Does the result indicate that the
network has learned above mapping.
(b) One of the way to understand the proper training is to test for
both forward mapping (response to a given input) and inverse
mapping (predict input for a given response). For inverse map-
ping, the iterative algorithm is given as

∂ E(t)
u(t + 1) = u(t) − η
∂ u(t)
i. Find the explicit expression for ∂ E/∂ u the network given.
ii. Using the iterative inversion algorithm predict input when
the desired response takes a value from the set [0.09, 0.25, 0.36, 0.64].
Select the initial value randomly from [0, 1].
iii. Repeat above steps while updating both centers and weights.

14. Figure 1.57 shows recurrent network consisting of three computation


nodes with self-feedback applied to all of them. Construct a multi-
layer feed-forward network by unfolding the temporal behavior of this
network.

x1 x2 x3

Figure 1.57: A recurrent network

15. Generate a set of random inputs u(k), k = 1, 2, ..., 200. Obtain the
training data set defined by the relation:
1
y(k + 1) = .u(k)
1 + y2 (k)

Given y(0) = 0. The network is shown in Figure 1.58. Unfold the


network in time up to t = 200.

16. Unfold the network shown in Figure 1.59 in time.

74
u(k) w y(k + 1)

Figure 1.58:

w2

u(t) w1
y(t + 1)

w3

w4

Figure 1.59:

17. Consider the dynamics of a single link manipulator

ml 2 θ̈ + mgl cos θ = τ

where θ is the manipulator angle, θ̇ is the angular velocity and τ is


the joint angle torque. Taking the system parameters as m = 1 kg,
l = 1 meter, g = 10 meter/sec2 , discretize state space model of the
system using Euler method. Take the sampling interval T as 0.001
sec. Generate 3000 pairs of input-output data and identify the system
using a

• Multi-layered network
• Radial Basis Function Network

Since the system is open-loop unstable, a PD controller with random


sinusoidal trajectories as the desired trajectories can be used for data
generation. Various dither signals like noise, impulse, step can be
added to the PD controller output so that the generated data span al-
most the entire workspace of the system.

Generate 1000 pairs of new test data and verify the identification re-
sult.

75
18. The state space model of a magnetic ball suspension is given by

dx1 (t)
= x2 (t)
dt
dx2 (t) x2 (t)
= g− 3
dt Mx1 (t)
dx3 (t) R 1
= − x3 (t) + v(t)
dt L L
where x1 (t) = y(t) is the ball position in meters, x2 (t) = ẏ(t) is the
ball velocity, x3 (t) = i(t) is the winding current and v(t) is the input
voltage. The system parameters are M = 0.1 kg, the mass of the ball,
g = 9.81 m/s2 , the gravitational acceleration, R = 50 Ω, the resistance
of the winding, L = 0.5 H, the winding inductance . The position
of the ball is measured by a position sensor. Express the system in
a discrete dynamical form and identify the dynamics of the system
using

• Multi-layered network
• Radial Basis Function Network

You should generate separate data sets for identification and model
validation. Compare the results of identification.

19. The model of a two-phase permanent magnet stepper motor is given


by

θ̇ = ω
J ω̇ = −Km ia sin(N θ ) + Km ib cos(N θ ) − Bω − TL
La i̇a = −Ra ia + Kb ω sin(N θ ) + va
Lb i̇b = −Rb ib − Kb ω cos(N θ ) + vb

where θ is the angular position, ω is the angular rate and ia , ib are


the currents for phases A and B, J is the moment of inertia, Km is the
motor torque constant, N is the number of teeth on the rotor, B is the
viscous friction coefficient, La , Lb and Ra , Rb are the inductances and
resistances of phases A and B, Kb is the back emf constant and TL is
the load torque. Design a control scheme for voltages va and vb such
that θ follows a desired trajectory r(t). Model the load torque for a
mechanical link of length l and mass m as mgl sin(θ ) where m is the
mass of the shaft.
The system parameters are tabulated below:

76
J = 0.0733 kg − m2 , B = 0.002N − m − sec/rad
m = 0.4014 kg, N = 50, l = 1m
La = Lb = 0.7 H, Ra = 0.9 Ω, Rb = 1.2Ω
Km = 0.25N − m/A, Kb = 0.025V − sec/rad, g = 9.81 m/sec2
Find out the discrete time representation of the system and identify
the dynamics using a radial basis function network.

20. Consider the following discrete time representation of a Surge Tank


system.
p
− 2gh(k) 1
h(k + 1) = h(k) + T + u(k)
A(h(k)) A(h(k))

where u(k) is the input flow,ph(k) is the liquid level of the tank (output
of the system) , A(h(k)) = ah2 (k) + b is the cross-sectional area of
the tank at kth instant. The input u(k) can be both positive or negative,
i.e., it can pull liquid out of the tank as well as put it in the tank. The
parameters a and b are given as a = 1 and b = 2. Considering T =
0.1 identify the system dynamics using a recurrent network. Employ
both BPTT and RTRL for learning the network.

21. The discrete dynamical representation of a Henon map is given as


follows:

x1 (k + 1) = −1.4 x12 (k) + x2 (k) + u(k) + 1


x2 (k + 1) = 0.3 x2 (k)

Identify the above dynamics using a Self Organizing Map.

77
Index

adaptive learning rate, 32 Lorentz Attractor, 59


Lyapunov Function based
back-propagation algorithm, 10 Adaptive Learning Rate,
batch update, 33 32
Competition, 49 momentum term, 32
Competitive Process, 49 multi-layered neural network, 10
Convergence of BP Learning, 20
Cooperation, 49 one-dimensional lattice, 50
Cooperative Process, 49 output layer, 10

feed-forward networks, 56 perceptron, 10


Pseudo-inverse approach, 24
Gaussian neighborhood, 50
Generalized Delta Rule, 16 radial basis function, 26
gradient descent algorithm, 20 radial basis function network, 27
Real Time Recurrent Learning,
hidden layer, 22 42
hybrid learning, 30 recurrent networks, 9
recursive least squares, 30
input layer, 10
instantaneous update, 18 sigmoidal activation function, 13
SOM based Identification, 63
K-mean clustering, 30 Synaptic Adaptation, 49
Kohonen lattice, 50 system identification, 56
Kohonen Self Organizing Map, 47
topological neighborhood, 49
learning rate, 52 two-dimensional lattice, 51
localized receptive field network,
22 unfolding, 74

78

You might also like