Ffnets Note PDF
Ffnets Note PDF
Ffnets Note PDF
The single layer perceptron was first devised by Rosenblatt in the late 1950's and early
1 Introduction 1960's. The basic model of a perceptron capable of classifying a pattern into one of
two classes is shown in Fig. 1.
The development of layered feed-forward networks began in the late 1950's,
(ADLINE)
x1
w
Both the perceptron and ADLINE are single layer networks and are often referred to x2 w2
wN
The limitations of the single layer network has led to the development of multi-layer
Figure 1 A basic perceptron model
feed-forward networks with one or more hidden layers, called multi-layer perceptron
(MLP) networks. MLP networks overcome many of the limitations of single layer
The machine consists of an array S of sensory units which are randomly connected to
perceptrons, and can be trained using the backpropagation algorithm. The
a second array A of associative units. Each of these units produces an output only if
backpropagation technique was invented independently several times.
enough of the sensory units which are connected to it are activated, that is, the output
signals of the associative units are binary.
In 1974, Werbos developed a backpropagation training algorithm. However, Werbos'
work remained almost unknown in the scientific community, and in 1985, Parker
The sensory units can be viewed as the means by which the machine receives stimuli
rediscovered the technique. Soon after Parker published his findings, Rumelhart,
from its external environment. The outputs of the associative units are the input to the
Hinton and Williams also rediscovered the technique. It is the efforts of Rumelhart
perceptron.
and the other members of the Parallel Distributed Processing (PDP) group, that make
the backpropagation technique a mainstay of neurocomputing.
The response of the machine is proportional to the weighted sum of the outputs of the
associative units; i.e., if xi denotes the output signal of the ith associative unit and wi
To date, backpropagation networks are the most popular neural network model and
the corresponding weight, the response is given by
have attracted most research interest among all the existing models.
1 2
N
r = ∑ wi xi (1) x1
i =1
Class1
and this response signal is passed through a hard limiting non-linearity to produce the
w1x1 + w2x2 = 0
output of the machine.
x2
Class2
+1 if r≥0
y=
−1 if r<0
Figure 2. Decision regions formed by the perceptron network
An effective technique for analysing the behaviour of the perceptron network shown
2.1 The Perceptron Training Algorithm
in Fig. 1 is to plot a map of the decision regions created in the multidimensional space
spanned by the input variables. The perceptron network of Fig. 1 forms the decision
The training algorithm for the perceptron network of Fig. 1 is a simple scheme for the
regions separated by a hyperplane defined by
iterative determination of the weight vector W. This scheme, known as the perceptron
The initial connection weights are set to small random non-zero values. A new input
As can be seen from (2), the decision boundary is determined by the connection pattern is then applied and the output is computed as
Fig. 2, inputs above the boundary line lead to a class 1 (y >= 0) response, and input the iteration index.
3 4
where η is a positive gain factor less than 1 n=0
n=80
The perceptron convergence procedure does not adapt the weights if the output
decision is correct.
If the output decision disagrees with the binary desired response d (n ) , however,
adaptation is effected by adding the weighted input vector to the weight vector when
Figure 3 An example of the perceptron convergence process
the error is positive, or subtracting the weighted input vector from the weight vector
Samples from class 1 are represented by circles in the figure, and samples from class 2
are represented by crosses. Samples from class 1 and class 2 were presented 3 The ADLINE and Widrow-Hoff Algorithm
alternately. The four lines show the four decision boundaries after the weights had
been adapted following errors on iterations 0, 2, 4, and 80. In this example it can be The adaptive linear element (ADLINE) is a simple type of processing element that has
seen that the classes were well separated after only four iterations. a real vector X as its input and a real number y as its output (Fig. 4) and uses the
5 6
x0
The hyperplane is given by
w0
x1
w1
x2 x0 w0 + x1w1 + L + xn wN = C
w2 y
x3 w3
For each input vector X ( n) to the ADLINE, there exists a corresponding target output
The input to the ADLINE is X = ( x0 , x1 ,..., x N ) , where x0 is the bias input, set to a
(desired output) d ( n ) . The cost function of the ADLINE is defined as
constant value (usually x0 =1). The output of the ADLINE is the inner product of the
y = x0 w0 + x1w1 +L + xn w N
where P is the number of training vectors, and y ( n ) is the actual output of ADLINE
for the nth training vector X ( n )
w0 x0 + w1 x1 + w2 x2 = C
In 1959, Bernard Widrow, along with his student Macron E. Hoff, developed an
x
algorithm for finding the weight vector W * . This algorithm is the well known
Widrow-Hoff algorithm, also known as the LMS law and the delta rule.
Figure 5 ADLINE decision boundary in the input space.For all input vectors X
The method for finding W * is to start from an initial value of W and then 'slide down'
that fall in the area above the decision hyperplane, the ADLINE output y > C . For
the ADLINE cost function surface until the bottom of the surface is reached. Since the
all input vectors X that fall in the area below the hyperplane, the ADLINE output cost function is a quadratic function of the weights, the surface is convex and has a
y < C.
unique (global) minimum.
7 8
4 Multi-layer Feed-forward Networks and the Backpropagation Training
The basic principle of the Widrow-Hoff algorithm is a gradient descent technique and Algorithm
has the form of
In the previous two sections, networks with only input and output units were
∂E
wi ( n + 1) = wi ( n) − η (6) described. These networks have proved useful in a wide variety of applications. The
∂wi
essential character of such networks is that they map similar input patterns to similar
we have
∂E 1 P ∂y (n) output patterns. This is why such networks can do a relatively good job in dealing
= ∑ 2(d (n) − y (n) ) −
∂wi 2 n =1 ∂wi with patterns that have never been presented to the networks. However the constraint
P
= ∑ (d (n) − y (n) )(− xi (n) ) (7) that similar input patterns lead to similar outputs is also a limitation of such networks.
n =1
P For many practical problems, very similar input patterns may have very different
= −∑ δ (n) xi (n)
n =1 output requirements. In such cases, the networks described in sections 2.2 and 2.3
may not be able to perform the necessary mappings.
where δ ( n) = d ( n) − y ( n) .
Instead of computing the true gradient using equation (7), the Widrow-Hoff algorithm Minsky and Papert pointed out that such networks cannot even solve the exclusive-or
uses the instantaneous gradient which is readily available from a single input data (XOR) problem illustrated in the table below.
sample, and the Widrow-Hoff training algorithm is given by
0 0 0
The training constant η determines the stability and convergence rate, and is usually
0 1 1
chosen by trial and error. If η is too large, the weight vector will not converge; if η is
1 0 1
too small, the rate of convergence will be slow.
1 1 0
9 10
x2 1 1
Decision boundary
y1
(0,1) (1,1)
y=1 y=0 x1
y2
x1
x2
(0,0) (1,0)
y=0
y=1
yM
xN
Figure 6 Single layer feed-forward network is incapable of solving the XOR problem
To see this in a more straightforward way, we recall from the above sections that Figure 7 A three layer feed-forward network
single layer networks form a hyperplane that separates the N-dimensional Euclidean
space of input vectors. In the case of the XOR problem, the input vectors are two- Multi-layer networks overcome many of the limitations of single layer networks, but
dimensional and the hyperplane (which is determined by the weights of the network) were generally not used in the past (before mid 1980s) because an effective training
is a straight line. As can be seen from Fig. 6, this line should divide the space such algorithm was not available. With the publication of the backpropagation training
that the points (0,0) and (1,1) lie on one side and the points (0,1) and (1,0) lie on the algorithm by Rumelhart, Hinton and Williams in the mid-1980's, multi-layer feed-
other side of the line. This is clearly impossible for a single layer network. forward networks, some times called multi-layer perceptron (MLP) networks have
become a mainstay of neural network research. In September 1992, Professor B.
To overcome the limitations of single layer networks, multi-layer feed-forward Widrow from Stanford University told the delegates of the 1992 European Neural
networks can be used, which not only have input and output units, but also have Network Conference that "three quarters of the neural network researchers in the USA
hidden units that are neither input nor output units. A three layer feed-forward work on backpropagation networks".
network with one hidden layer is shown in Fig. 7.
The capabilities of multi-layer networks stem from the non-linearities used with the
units. Each neuron in the network receives inputs from other neurons in the network,
or receives inputs from the outside world. The outputs of the neurons are connected to
other neurons or to the outside world. Each input is connected to the neurons by a
11 12
weight. The neuron calculates the weighted sum of the inputs (called the activation), Backpropagation Network Function Approximation Theorem: Given any ε > 0
→ R M , there exists a three-layer (with two hidden
N
which is passed through a non-linear transfer function to produce the actual output for and any L2 function f : 0, 1
the neuron. The most popular non-linear transfer function is of the sigmoidal type. layers) backpropagation network that can approximate f to within ε mean squared
error accuracy.
A typical sigmoid function has the form:
The above theorem guarantees the ability of a multi-layer network with the correct
1 (4.1)
f (x) = weights to accurately implement an arbitrary L2 function. It does not state how the
1 + e − gx
weights should be selected or even whether these weights can be found using existing
When g become large, the sigmoid function become a signum function as shown in network learning algorithms.
Fig. 8
1 Notwithstanding the fact that the backpropagation networks are not guaranteed to be
f(x) able to find the correct weights for any given task, they have found numerous
applications in a variety of problems. Sejnowski and Rosenberg have demonstrated
that backpropagation networks can learn to convert text to realistic speech. Burr has
0.5
g=0.5 shown that backpropagation networks can be used for recognition of spoken digits
g=1
and hand-written characters with excellent performance. Backpropagation networks
g=5
can also be used for data compression by forcing the output to reproduce the input for
x
0
a network with a lower-dimensional hidden layer, and many many more …
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9
13 14
difference, no training takes place, otherwise the weights of the network are changed unit i produced by the presentation of input pattern n,and woji is the weight
to reduce the difference between actual and desired outputs. connecting hidden unit i and output unit j.
Cost Function of Backpropagation Network: The cost function that the The outputs of the hidden units and output units are, respectively,
backpropagation network tries to minimise is the squared difference between the
actual and desired output value summed over the output units and all pairs of input ohj (n ) = f (shj (n )) (4.5)
where d j (n ) is the desired output for the jth component of the output vector for input
pattern n, M is the number of output units,and o j (n ) is the jth element of the actual be the overall measure of error, where P is the total number of the training samples.
output vector produced by the presentation of input pattern n. E is called the Cost Function of the backpropagation network. The backpropagation
algorithm is the technique which finds the weights that minimise the cost function E .
shj (n ) = ∑ whji xi (n )
N
Let (4.3)
i =1
of input units. network. In the process of training the network, only the discrete approximation to the
true gradient of E can be obtained and used.
H
Similarly, let soj (n ) = ∑ woji ohi (n ) (4.4)
i =1
Following the presentation of each pattern to the network, the weights can be updated
be the weighted sum input to unit j in the output layer produced by the presentation of
according to
input pattern n, where H is the number of hidden units, ohi(n) is the output of hidden
15 16
∂E (n )
wij (n +1) = wij (n ) − η (4.8) Now let us define
∂wij
∂E (n )
δ oj (n ) = − (4.12)
Alternatively, following the presentation of a complete cycle of patterns the weights ∂soj (n )
∂E (n ) ∂E (n ) ∂o j (n )
δ oj (n ) = − = − (4.13)
where wij ( n) is the value of wij before updating, wij ( n + 1) is the value of wij after ∂soj (n ) ∂o (n ) ∂s (n )
j oj
updating, and η is the learning rate which determines the convergence rate and
∂o j (n )
The above two Equations are is called the on-line training mode and batch training = f ′(soj (n )) (4.14)
∂soj (n )
mode. In training the networks, either equation may be used to achieve almost the
same results.
To compute the first factor in equation (4.13), following the definition of E (n ) in
∂E (n ) ∂E (n ) equation (4.2), we have
The Chain Rule for Calculating and
∂woji ∂whji
∂E (n )
= −(d j (n ) − o j (n )) (4.15)
∂E (n ) ∂o j (n )
First we compute
∂woji
We can write
Substituting for the two factors in equation (4.13), we have
∂E (n ) ∂E (n ) ∂soj (n )
=
∂woji ∂soj (n ) ∂woji
(4.10) δ oj (n ) = (d j (n ) − o j (n )) f ′(soj (n )) (4.16)
According to equation(4.4) we have combining equations (4.10), (4.11) and (4.16), we have
∂E (n )
H
= −o j (n )(d j (n ) − o j (n )) f ′(soj (n )) (4.17)
∂ ∑ woji ohi ∂woji
∂soj
= i =1 = o (n ) (4.11)
∂woji ∂woji hi
17 18
∂E (n )
Now to compute , we can write Depending on whether the units are in the output layer or the hidden layer, the partial
∂whji
derivative is calculated according to equations (4.17) and (4.24), or equations (4.22)
∂E (n ) ∂W (n ) ∂shj (n ) and (4.23), respectively. The backpropagation training can be summarised in the
= (4.18)
∂whji ∂shj (n ) ∂whji following steps, which are executed iteratively until the cost function E has decreased
to an acceptable value:
According to equation (4.5) we have
∂E (n )
= − xi (n ) f ′(shj (n ))∑ δ ok (n )wokj
M
(4.22)
∂whji k =1
∂E P
∂E (n )
=∑ (4.23)
∂woij n =1 ∂woij
and
∂E P
∂E (n )
=∑ (4.24)
∂whij n =1 ∂whij
19 20
BACKPROPAGATION TRAINING ALGORITHM //output layer output//
for(k=0;k<OUTPUT_VECTOR_SIZE;k++)
Backpro_proc() Begin //OUTPUT_VECTOR_SIZE # of output nodes
{
///////////////////////////////////////////////////////////////////// o[k]=0;
//Define Input, output, error, gradient, bias, and weight vectors // for(i=0;i<h_node;i++)
{
float w[2][64][64],e[64],eh[64],h[64],o[64],y[64], o[k]+=H[i]*w[1][i][k];
H[64],b[2][64], input[64]; }
o[k]+=b[1][k];
// w[0][[i][j] ith inpu to jth hidden neuron y[k]=sigmoid(o[k]);
// w[1][i][j] ith hidden to jth output neuron }
// b[0][i] Bias of ith hidden neuron
// b[1][i] Bias of the ith output neuron /////////////////////////////////////////////////////////////////////
// e[i] Error information for ith output neuron
// eh[i] Error information for ith hidden neuron Step 3
// h[i] weighted sum of ith hidden neuron
// H[i] Output of the ith hidden neuron //error information//
// o[i] weighted sum of the ith output neuron
// y[i] Output of the ith output neuron
// input[i] ith component of the input vector //output layer//
///////////////////////////////////////////////////////////////////// for(i=0;i<OUTPUT_VECTOR_SIZE;i++)
{
e[i]=grad(o[i])*(output[i]-y[i]);
//Step 1//
}
//Initial wights//
//hidden layer//
srand(seed);
for(l=0;l<2;l++) for(i=0;i<h_node;i++)
{ {
for(i=0;i<64;i++) eh[i]=0;
{ for(k=0;k<OUTPUT_VECTOR_SIZE;k++)
for(j=0;j<64;j++) {
{ eh[i]+=e[k]*w[1][i][k];
w[l][i][j]=(float)(random(2400)-1200.0)/5000; }
} eh[i]=grad1(h[i])*eh[i];
} }
}
/////////////////////////////////////////////////////////////////////
for(l=0;l<2;l++)
{ Step 4
for(i=0;i<64;i++)
//update weights//
{
b[l][i]=(float)(random(2400)-1200.0)/10000.0;
}
//output layer//
}
for(i=0;i<h_node;i++)
///////////////////////////////////////////////////////////////////// {
for(j=0;j<OUTPUT_VECTOR_SIZE;j++)
//Step 2 {
//hidden layer output//
w[1][i][j]=w[1][i][j]+A*H[i]*e[j]; //A = training rate, η
for(k=0;k<h_node;k++) //h_node = #of hidden nodes// }
}
{
h[k]=0;
for(j=0;j<OUTPUT_VECTOR_SIZE;j++)
for(i=0;i<INPUT_VECTOR_SIZE;i++)
{
//INPUT_VECTOR_SIZE #input nodes
{ b[1][j]=b[1][j]+A*e[j];
h[k]+=input[i]*w[0][i][k]; }
}
h[k]+=b[0][k];
H[k]=sigmoid(h[k]);
}
21 22
//hidden layer//
for(i=0;i<INPUT_VECTOR_SIZE;i++)
{ 1. Exclusive-Or (XOR) Task
for(j=0;j<h_node;j++)
{
} The network consists of two input units, two hidden units, and one output unit.
}
for(j=0;j<h_node;j++)
{ 2. 8-3-8 Encoder Task :
b[0][j]=b[0][j]+A*eh[j];
}
///////////////////////////////////////////////////////////////////// The network consists of eight input units, three hidden units, and eight output units.
The network consists of ten input units, five hidden units, and ten output units.
//First order derivertive of the Sigmoid function//
23 24
input output
8
0111111111 0111111111
7
1011111111 1011111111
1101111111 1101111111 6
1110111111 1110111111 5
1111011111 1111011111 4
1111101111 1111101111
3
1111110111 1111110111
C
1111111011 1111111011 2
1111111101 1111111101 1
1111111110 1111111110 0
0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900
5 Iterations
4.5
4
Training of 10-5-10 complement encoder task
3.5
3 5.5
2.5
4.5
2
Cost function
C
1.5
3.5
1
0.5 2.5
0
0 50 100 150 200 250 300 350 400 1.5
Iterations
0.5
0
0 500 1000 1500 2000 2500 3000 3500
Iterations
25 26