NNFL 3unit
NNFL 3unit
8.0.0 Introduction
We know that perceptron is one of the early models of artificial neuron. It was proposed by
Rosenblatt in 1958. It is a single layer neural network whose weights and biases could be trained
to produce a correct target vector when presented with the corresponding input vector. The
perceptron is a program that learns concepts, i.e. it can learn to respond with True (1) or False
(0) for inputs we present to it, by repeatedly "studying" examples presented to it. The training
technique used is called the perceptron learning rule. The perceptron generated great interest due
to its ability to generalize from its training vectors and work with randomly distributed
connections. Perceptrons are especially suited for simple problems in pattern classification. In
this also we give the perceptron convergence theorem.
8.1.0 Perceptron Model
In the 1960, perceptrons created a great deal of interest and optimism. Rosenblatt (1962) proved
a remarkable theorem about perceptron learning. Widrow (Widrow 1961, 1963, Widrow and
Angell 1962, Widrow and Hoff 1960) made a number of convincing demonstrations of
perceptron like systems. Perceptron learning is of the supervised type. A perceptron is trained by
presenting a set of patterns to its input, one at a time, and adjusting the weights until the desired
output occurs for each of them.
The schematic diagram of perceptron is shown inn Fig. 8.1. Its synaptic weights are denoted by
w1, w2, . . . wn. The inputs applied to the perceptron are denoted by x 1, x2, . . . . xn. The
externally applied bias is denoted by b.
bias, b
x1 w1
net Output
w2 f(.) o
x2
Hard limiter
wn
Fig. 8.1 Schematic diagram of perceptron
xn
Inputs
The net input to the activation of the neuron is written as
n
net w i x i b (8.1)
i 1
where W = [w0, w1, w2, . . . . wn] and X = [x0, x1, x2, . . . xn]T.
The learning rule for perceptron has been discussed in unit 7. Specifically the learning of these
two models is discussed in the following sections.
8.2.0 Single Layer Discrete Perceptron Networks
For discrete perceptron the activation function should be hard limiter or sgn() function.
The popular application of discrete perceptron is a pattern classification. To develop insight into
the behavior of a pattern classifier, it is necessary to plot a map of the decision regions in n-
dimensional space, spanned by the n input variables. The two decision regions separated by a
hyper plane defined by
n
w
i 0
i xi 0 (8.4)
This is illustrated in Fig. 8.2 for two input variables x 1 and x2, for which the decision boundary
takes the form of a straight line.
x2
Class
C1
x1
Class
C2
straight lines)
Class C2
Class C2
Class C1
Class C1
(a) (b)
In Fig. 8.3(a), the two classes C1 and C2 are sufficiently separated from each other to draw a
hyper plane (in this it is a straight line) as the decision boundary. If however, the two classes C 1
and C2 are allowed to move too close to each other, as in Fig. 8.3 (b), they become nonlinearly
separable, a situation that is beyond the computing capability of the perceptron.
Suppose then that the input variables of the perceptron originate from two linearly separable
classes. Let æ1 be the subset of training vectors X1(1), X1(2), . . . . , that belongs to class C1 and
æ2 be the subset of train vectors X2(1), X2(2), . . . . . , that belong to class C 2. The union of æ1
and æ2 is the complete training set æ. Given the sets of vectors æ 1 and æ2 to train the classifier,
the training process involves the adjustment of the W in such a way that the two classes C 1 and
C2 are linearly separable. That is, there exists a weight vector W such that we may write,
In the second condition, it is arbitrarily chosen to say that the input vector X belongs to
class C2 if WTX = 0.
The algorithm for updating the weights may be formulated as follows:
1. If the kth member of the training set, Xk is correctly classified by the weight vector W(k)
computed at the kth iteration of the algorithm, no correction is made to the weight vector
of perceptron in accordance with the rule.
Wk+1 = Wk if WkTXk >0 and Xk belongs to class C1 (8.6)
where the learning rule parameter controls the adjustment applied to the weight vector.
Equations (8.8a) and (8.8b) may be written general expression as
W ( k 1) W kT η(d - o)X k (8.9)
Step 2: Initialize the weights at small random values, W = [w ij] , augmented size is (n+1)×1 and
initialize counters and error function as:
k 1, p 1 E 0.
Step 3: The training cycle begins. Apply input and compute the output:
X Xp, d dp , o sgn(WX)
where is the learning constant and is the positive constant and the superscript k denotes the
step number. Let us define the error function between the desired output d k and actual output ok
as
1
Ek d k - o k 2 (8.11a)
2
or
Ek =
1
2
dk - f Wk X
2
(8.11b)
where the coefficient ½ in from of the error expression is only for convenience in simplifying the
expression of the gradient value and it does not effect the location of the error function
minimization. The error minimization algorithm (8.10) requires computation of the gradient of
the error function (8.11) and it may be written as
1
d - f(net k )
2
E ( W k ) (8.12)
2
E
w
0
E
The n+1 dimensional gradient vector is defined as w (8.13)
1
k
E (W ) .
.
E
wn
(net k )
w
0
(net k )
Using (8.12), we obtain the gradient vector as w1 (8.14)
E (W k ) - (d k - o k ) f ' (net k ) .
.
(net k )
wn
( net k )
xi, for i =0, 1, . . . n. (8.15)
wi
or
E
- (d k - o k )f ' (net k )x i for i = 0, 1, . . . n (8.16b)
wi
k
w i - E(W k ) (d k - o k )f ' (net k )x i (8.17)
Equation (8.17) is the training rule for the continuous perceptron. Now the requirement is how to
calculate f ' ( net ) in terms of continuous perceptron output. Consider the bipolar activation
function f(net) of the form
2
f (net ) - 1 (8.18)
1 exp(-net)
2 exp(-net)
Differentiating the equation (8.18) with respect to net: f ' ( net ) (8.19)
1 exp(-net)2
The following identity can be used in finding the derivative of the function.
2 exp(-net) 1
= (1 - o 2 ) (8.20)
1 exp(-net) 2
2
1 1 exp( net )
2
1 2
(1 - o ) 1 (8.21)
2 2 1 exp( net )
The right side of (8.21) can be rearranged as
1 1 exp(net )
2
2 exp(net )
1 (8.22)
2 1 exp(net ) 1 exp(net )
2
This is same as that of (8.20) and now the derivative may be written as
1 2
f ' (net k ) (1 o k ) (8.23)
2
1 2
The gradient (8.16a) can be written as E ( W k ) - (d k - o k ) (1 - o k )X (8.24)
2
and the complete delta training for the bipolar continuous activation function results from (8.24)
as
1
W ( k 1)T W kT (d k - o k ) (1 - o k 2 )Xk (8.25)
2
x i0
x
i1
X i . , where xi0 = 1.0 (bias element)
.
x in
Let k is the training step and p is the step counter within the training cycle.
Step 1: 0 and Emax > 0 chosen.
Step 2: Weights are initialized at W at small random values, W = [w ij] is (n+1)×1. Counter and
error function are initialized.
k 1, p 1 E 0.
Step 3: The training cycle begins. Input is presented and output is computed.
X Xp, d dp , o f(WX)
1
Step 4: Weights are updated: W T W T (d - o) (1 - o 2 )X
2
1
Step 5: Cycle error is computed: E (d o ) 2 E
2
Step 6: If p < P, the p p+1, k k+1 and go to step 3, otherwise go to step 7.
Step 7: The training cycle is completed. For E < E max terminated the training session with output
weights, k and E. If E E max , then E 0, p 1 and enter the new training cycle by
going to step 3.
8.4.0 Perceptron Convergence Theorem
This theorem states that the perceptron learning law converges to a final set of weight
values in a finite number of steps, if the classes are linear separable. The proof of this theorem is
as follows:
Let X and W are the augmented input and weight vectors respectively. Assume that there
exits a solution W* for the classification problem, we have to show that W * can be approached in
a finite number of steps, starting from some initial weight values. We know that the solution W *
satisfies the following inequality as per the equation (8.5):
W*X > >0, for each X C1 (8.26)
where min (W *T X )
XC1
W k 1
W k
X ( k ) , for X(k) = X C1 (8.27)
where X(k) is used to denote the input vector at step k. If we start with W(0)=0, where 0 is an
all zero column vector, then
k 1
W k X (i) (8.28)
i 0
since W
*T
X ( k ) according to equation (8.26). Using the Cauchy-Schwartz inequality
2 2
W *T . W k [W *T W k ]2 (8.30)
We get from equation (8.29)
2 2 k 2 2 (8.31)
Wk 2
W *T
W k 2
2
X (k )
2
2 W kT
X (k ) (8.32)
2 2
W k
2
X (k )
since for learning W kT X (k ) 0 when X(k) C1 . Therefore, starting from W0=0, we get from
equation (8.32)
2 k 1
2 X (i) 2 k
2
Wk (8.33)
i 0
where β max X (i ) 2 . Combining equations (8.31) and (8.33), we obtain the optimum value of k
X ( i )C
1
by solving
k 2
2
(8.34) or
2
βk
W *T
β 2 β 2
k W *T W* (8.35)
2
2
Since is positive, equation (8.35) shows that the optimum weight value can be
approached in a finite number of steps using the perceptron learning law.
8.5.0 Problems and Limitations of the perceptron training algorithms
It may be difficult to determine if the caveat regarding linear separability is satisfied for
the particular training set at hand. Further more, in many real world situations the inputs are
often time varying and may be separable at one time and not at another. Also, these is no
statement in the proof of the perceptron learning algorithm that indicates how many steps will be
required to train the network. It is small consolation; to know that training will only take a finite
number of steps if the time it takes is measured in geological units.
Further more, there is no proof that perceptron training algorithm is faster than simply
trying all possible adjustment of the weights; in some cases this brute force approach may be
superior.
8.5.1 Limitations of perceptrons
There are limitations to he capabilities of perceptrons however. They will learn the
solution, if there is a solution to be found. First, the output values of a perceptron can take on
only one of two values (True or False). Second, perceptrons can only classify linearly separable
sets of vectors. If a straight line or plane can be drawn to separate the input vectors into their
correct categories, the input vectors are linearly separable and the perceptron will find the
solution. If the vectors are not linearly separable learning will never reach a point where all
vectors are classified properly. The most famous example of the perceptron's inability to solve
problems with linearly non-separable vectors is the boolean exclusive-OR problem.
Consider the case of the exclusive-or (XOR) problem. The XOR logic function has two
inputs and one output, how below.
x x y Z
z
y
0 0 0
(a) Exclusive –OR gate
0 1 1
1 0 1
1 1 0
x y inputs output
0 0 0 0 0 0,1 1,1
1
1 0 1 0 1
0 1 0 1 1
0,0 1,0
1 1 1 1 0
0 1 x
Fig. 8.5 The XOR problem in pattern space