0% found this document useful (0 votes)
58 views10 pages

NNFL 3unit

The document discusses perceptrons, which are early models of artificial neurons. A perceptron is a single-layer neural network that can be trained to produce correct target outputs when presented with corresponding input vectors. Perceptrons use a learning rule called the perceptron learning rule to repeatedly study examples and learn concepts. The key components of a perceptron model are its inputs, weights, bias, net input calculation, activation function, and learning rule for updating weights. Perceptrons are well-suited for simple pattern classification problems if the patterns are linearly separable. The document then focuses on training algorithms for discrete and continuous perceptrons.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views10 pages

NNFL 3unit

The document discusses perceptrons, which are early models of artificial neurons. A perceptron is a single-layer neural network that can be trained to produce correct target outputs when presented with corresponding input vectors. Perceptrons use a learning rule called the perceptron learning rule to repeatedly study examples and learn concepts. The key components of a perceptron model are its inputs, weights, bias, net input calculation, activation function, and learning rule for updating weights. Perceptrons are well-suited for simple pattern classification problems if the patterns are linearly separable. The document then focuses on training algorithms for discrete and continuous perceptrons.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

PERCEPTRONS

8.0.0 Introduction
We know that perceptron is one of the early models of artificial neuron. It was proposed by
Rosenblatt in 1958. It is a single layer neural network whose weights and biases could be trained
to produce a correct target vector when presented with the corresponding input vector. The
perceptron is a program that learns concepts, i.e. it can learn to respond with True (1) or False
(0) for inputs we present to it, by repeatedly "studying" examples presented to it. The training
technique used is called the perceptron learning rule. The perceptron generated great interest due
to its ability to generalize from its training vectors and work with randomly distributed
connections. Perceptrons are especially suited for simple problems in pattern classification. In
this also we give the perceptron convergence theorem.
8.1.0 Perceptron Model
In the 1960, perceptrons created a great deal of interest and optimism. Rosenblatt (1962) proved
a remarkable theorem about perceptron learning. Widrow (Widrow 1961, 1963, Widrow and
Angell 1962, Widrow and Hoff 1960) made a number of convincing demonstrations of
perceptron like systems. Perceptron learning is of the supervised type. A perceptron is trained by
presenting a set of patterns to its input, one at a time, and adjusting the weights until the desired
output occurs for each of them.
The schematic diagram of perceptron is shown inn Fig. 8.1. Its synaptic weights are denoted by
w1, w2, . . . wn. The inputs applied to the perceptron are denoted by x 1, x2, . . . . xn. The
externally applied bias is denoted by b.

bias, b
x1 w1

net Output
w2 f(.) o
x2 
Hard limiter
wn
Fig. 8.1 Schematic diagram of perceptron
xn

Inputs
The net input to the activation of the neuron is written as
n
net   w i x i  b (8.1)
i 1

The output of perceptron is written as o = f(net) (8.2)


where f(.) is the activation function of perceptron. Depending upon the type of activation
function, the perceptron may be classified into two types
i) Discrete perceptron, in which the activation function is hard limiter or sgn(.) function
ii) Continuous perceptron, in which the activation function is sigmoid function, which is
differentiable. The input-output relation may be rearranged by considering w 0=b and fixed
bias x0 = 1.0. Then
n
net  w x
i 0
i i  WX (8.3)

where W = [w0, w1, w2, . . . . wn] and X = [x0, x1, x2, . . . xn]T.
The learning rule for perceptron has been discussed in unit 7. Specifically the learning of these
two models is discussed in the following sections.
8.2.0 Single Layer Discrete Perceptron Networks
For discrete perceptron the activation function should be hard limiter or sgn() function.
The popular application of discrete perceptron is a pattern classification. To develop insight into
the behavior of a pattern classifier, it is necessary to plot a map of the decision regions in n-
dimensional space, spanned by the n input variables. The two decision regions separated by a
hyper plane defined by
n

w
i 0
i xi  0 (8.4)

This is illustrated in Fig. 8.2 for two input variables x 1 and x2, for which the decision boundary
takes the form of a straight line.

x2

Class
C1
x1

Class
C2

Fig. 8.2 Illustration of the hyper plane (in this example, a

straight lines)

as decision boundary for a two dimensional, two-class


For the perceptron to function properly, the two classed C 1 and C2 must be linearly separable.
This in turn, means that the patterns topatron
be classified must
classification be sufficiently separated from each
problem.
other to ensure that the decision surface consists of a hyper plane. This is illustrated in Fig. 8.3.
Decision boundary

Class C2

Class C2
Class C1

Class C1
(a) (b)

Fig. 8.3 (a) A pair of linearly separable

(b) non separable patterns


(b) A pair of nonlinearly separable patterns.

In Fig. 8.3(a), the two classes C1 and C2 are sufficiently separated from each other to draw a
hyper plane (in this it is a straight line) as the decision boundary. If however, the two classes C 1
and C2 are allowed to move too close to each other, as in Fig. 8.3 (b), they become nonlinearly
separable, a situation that is beyond the computing capability of the perceptron.
Suppose then that the input variables of the perceptron originate from two linearly separable
classes. Let æ1 be the subset of training vectors X1(1), X1(2), . . . . , that belongs to class C1 and
æ2 be the subset of train vectors X2(1), X2(2), . . . . . , that belong to class C 2. The union of æ1
and æ2 is the complete training set æ. Given the sets of vectors æ 1 and æ2 to train the classifier,
the training process involves the adjustment of the W in such a way that the two classes C 1 and
C2 are linearly separable. That is, there exists a weight vector W such that we may write,

W T X  0 for every input vector X belonging to class C1 


 (8.5)
W T X  0 for every input vector X belonging to class C 2 

In the second condition, it is arbitrarily chosen to say that the input vector X belongs to
class C2 if WTX = 0.
The algorithm for updating the weights may be formulated as follows:
1. If the kth member of the training set, Xk is correctly classified by the weight vector W(k)
computed at the kth iteration of the algorithm, no correction is made to the weight vector
of perceptron in accordance with the rule.
Wk+1 = Wk if WkTXk >0 and Xk belongs to class C1 (8.6)

Wk+1 = Wk if WkT X k  0 and Xk belongs to class C2 (8.7)


2. Otherwise, the weight vector of the perceptron is updated in accordance with the rule.

W ( k 1)T  W kT - ηX k if WkT Xk >0 and Xk belongs to class C2 (8.8a)

W ( k 1)T  W kT  η X k if WkTXk  0 and Xk belongs to class C1 (8.8b)

where the learning rule parameter  controls the adjustment applied to the weight vector.
Equations (8.8a) and (8.8b) may be written general expression as
W ( k 1)  W kT  η(d - o)X k (8.9)

8.2.1 Summary of the discrete perceptron training algorithm


Given are P training pairs of patterns
{X1, d1, X2, d2, . . . . Xp, dp}, where Xi is (n×1), di is (1×1), i = 1, 2, . . . P. Define w0=b is bias
and X0 = 1.0 , then the size of augmented input vector is Xi ((n+1)×1).
In the following, k denotes the training step and p denotes the step counter with the training
cycle.
Step 1:   0 is chosen and define Emax.

Step 2: Initialize the weights at small random values, W = [w ij] , augmented size is (n+1)×1 and
initialize counters and error function as:
k  1, p 1 E  0.
Step 3: The training cycle begins. Apply input and compute the output:
X  Xp, d  dp , o  sgn(WX)

Step 4: Update the weights: W T  W T   (d - o) X

Step 5: Compute the cycle error: E  ( d  o)  E

Step 6: If p < P, the p  p+1, k  k+1 and go to step 3, otherwise go to step 7.


Step 7: The training cycle is completed. For E< E max terminates the training session with output
weights and k. If E > Emax , then E  0, p  1 and enter the new training cycle by going
to step 3.
In general, a continuous perceptron element with sigmoidal activation function will be used to
facilitate the training of multi layer feed forward networks used for classification and
recognition.
8.3.0 Single-Layer Continuous Perceptron networks
In this, the concept of an error function in multidimensional weight space has been
introduced. Also the hard limiter (sgn(.)) with weights will be replaced by the continuous
perceptron. By introduction of this continuous activation function, there are two advantages (i)
finer control over the training procedure and (ii) differential characteristics of the activation
function, which is used for computation of the error gradient.
The gradient or steepest descent is used in updating weights starting from any arbitrary weight
vector W, the gradient E(W) of the current error function is computed. The next value of W as
obtained by moving in the direction of the negative gradient along the multidimensional error
surface. Therefore the relation of modification of weight vector may be written as

W ( k 1)T  W kT - E(W k ) (8.10)

where  is the learning constant and is the positive constant and the superscript k denotes the
step number. Let us define the error function between the desired output d k and actual output ok
as

1
Ek  d k - o k 2 (8.11a)
2
or

Ek =
1
2
 
dk - f Wk X 
2
(8.11b)

where the coefficient ½ in from of the error expression is only for convenience in simplifying the
expression of the gradient value and it does not effect the location of the error function
minimization. The error minimization algorithm (8.10) requires computation of the gradient of
the error function (8.11) and it may be written as

1
d - f(net k )
2
E ( W k )  (8.12)
2
 E 
 w 
 0
 E 
The n+1 dimensional gradient vector is defined as  w  (8.13)
 1
k
E (W )   . 
 
 . 
 E 
 
 wn 
 
  (net k ) 
 w 
 0

  (net k ) 
 
Using (8.12), we obtain the gradient vector as  w1  (8.14)
E (W k )  - (d k - o k ) f ' (net k )  . 
 
 . 
  (net k ) 
 
 wn 
 
 

Since netk = WkX, we have

 ( net k )
 xi, for i =0, 1, . . . n. (8.15)
wi

(x0=1 for bias element) and equation (8.15) can be written as

E (W k )  - (d k - o k )f ' (net k )X (8.16a)

or

E
 - (d k - o k )f ' (net k )x i for i = 0, 1, . . . n (8.16b)
wi
k
 w i  - E(W k )   (d k - o k )f ' (net k )x i (8.17)

Equation (8.17) is the training rule for the continuous perceptron. Now the requirement is how to
calculate f ' ( net ) in terms of continuous perceptron output. Consider the bipolar activation
function f(net) of the form

2
f (net )  - 1 (8.18)
1  exp(-net)

2  exp(-net)
Differentiating the equation (8.18) with respect to net: f ' ( net )  (8.19)
1  exp(-net)2
The following identity can be used in finding the derivative of the function.

2  exp(-net) 1
= (1 - o 2 ) (8.20)
1  exp(-net) 2
2

The relation (8.20) may be verified as follows:

1   1  exp(  net )  
2
1 2
(1 - o )  1     (8.21)
2 2   1  exp(  net )  
 
The right side of (8.21) can be rearranged as

1   1  exp(net )  
2
2 exp(net )
1     (8.22)
2   1  exp(net )   1  exp(net )
2

This is same as that of (8.20) and now the derivative may be written as

1 2
 f ' (net k )  (1  o k ) (8.23)
2
1 2
The gradient (8.16a) can be written as E ( W k )  - (d k - o k ) (1 - o k )X (8.24)
2
and the complete delta training for the bipolar continuous activation function results from (8.24)
as
1
W ( k 1)T  W kT   (d k - o k ) (1 - o k 2 )Xk (8.25)
2

where k denotes the reinstated number of the training step.


The weight adjustment rule (8.25) corrects the weights in the same direction as the discrete
perceptron learning rule as in equation (8.8). The main difference between these two is the
presence of the moderating factor (1-ok2). This scaling factor is always positive and smaller than
1. Another main difference between the discrete and continuous perceptron training is that the
discrete perceptron training algorithm always leads to a solution for linearly separable problems.
In contrast to this property, the negative gradient-based training does not guarantee solutions for
linearly separable patterns.
8.3.1 Summary of the Single Continuous Perceptron Training Algorithm
Given are P training pairs
{X1, d1, X2, d2, . . . . . Xp, dp}, where Xi is ((n+1)1), di is (11), for i = 1, 2, . . . P

 x i0 
x 
 i1 
X i   .  , where xi0 = 1.0 (bias element)
 
 . 
 x in 
 

Let k is the training step and p is the step counter within the training cycle.
Step 1:   0 and Emax > 0 chosen.

Step 2: Weights are initialized at W at small random values, W = [w ij] is (n+1)×1. Counter and
error function are initialized.

k  1, p 1 E  0.
Step 3: The training cycle begins. Input is presented and output is computed.
X  Xp, d  dp , o  f(WX)
1
Step 4: Weights are updated: W T  W T   (d - o) (1 - o 2 )X
2

1
Step 5: Cycle error is computed: E  (d  o ) 2  E
2
Step 6: If p < P, the p  p+1, k  k+1 and go to step 3, otherwise go to step 7.
Step 7: The training cycle is completed. For E < E max terminated the training session with output
weights, k and E. If E  E max , then E  0, p  1 and enter the new training cycle by
going to step 3.
8.4.0 Perceptron Convergence Theorem
This theorem states that the perceptron learning law converges to a final set of weight
values in a finite number of steps, if the classes are linear separable. The proof of this theorem is
as follows:
Let X and W are the augmented input and weight vectors respectively. Assume that there
exits a solution W* for the classification problem, we have to show that W * can be approached in
a finite number of steps, starting from some initial weight values. We know that the solution W *
satisfies the following inequality as per the equation (8.5):
W*X >  >0, for each X C1 (8.26)

where   min (W *T X )
XC1

The weight vector is updated if W kTX0, for X C1. That is,

W k 1
W k
  X ( k ) , for X(k) = X C1 (8.27)
where X(k) is used to denote the input vector at step k. If we start with W(0)=0, where 0 is an
all zero column vector, then
k 1
W k    X (i) (8.28)
i 0

Multiplying both sides of equation (8.28) by W *T, we get


k 1
W *T W k   W *T X (i)  k (8.29)
i 0

since W
*T
X ( k )   according to equation (8.26). Using the Cauchy-Schwartz inequality
2 2
W *T . W k  [W *T W k ]2 (8.30)
We get from equation (8.29)

2  2 k 2 2 (8.31)
Wk  2
W *T

We also have from equation (8.27)


k 1 2
W  (W k
  X ( k )) T (W k
  X ( k ))

 W k 2
 2
X (k )
2
 2 W kT
X (k ) (8.32)
2 2
 W k
 2
X (k )

since for learning W kT X (k )  0 when X(k)  C1 . Therefore, starting from W0=0, we get from
equation (8.32)
2 k 1
  2  X (i)   2 k
2
Wk (8.33)
i 0

where β  max X (i ) 2 . Combining equations (8.31) and (8.33), we obtain the optimum value of k
X ( i )C
1

by solving
k 2
2
(8.34) or
2
 βk
W *T

β 2 β 2
k W *T  W* (8.35)
 2
 2

Since  is positive, equation (8.35) shows that the optimum weight value can be
approached in a finite number of steps using the perceptron learning law.
8.5.0 Problems and Limitations of the perceptron training algorithms
It may be difficult to determine if the caveat regarding linear separability is satisfied for
the particular training set at hand. Further more, in many real world situations the inputs are
often time varying and may be separable at one time and not at another. Also, these is no
statement in the proof of the perceptron learning algorithm that indicates how many steps will be
required to train the network. It is small consolation; to know that training will only take a finite
number of steps if the time it takes is measured in geological units.
Further more, there is no proof that perceptron training algorithm is faster than simply
trying all possible adjustment of the weights; in some cases this brute force approach may be
superior.
8.5.1 Limitations of perceptrons
There are limitations to he capabilities of perceptrons however. They will learn the
solution, if there is a solution to be found. First, the output values of a perceptron can take on
only one of two values (True or False). Second, perceptrons can only classify linearly separable
sets of vectors. If a straight line or plane can be drawn to separate the input vectors into their
correct categories, the input vectors are linearly separable and the perceptron will find the
solution. If the vectors are not linearly separable learning will never reach a point where all
vectors are classified properly. The most famous example of the perceptron's inability to solve
problems with linearly non-separable vectors is the boolean exclusive-OR problem.
Consider the case of the exclusive-or (XOR) problem. The XOR logic function has two
inputs and one output, how below.

x x y Z
z
y
0 0 0
(a) Exclusive –OR gate
0 1 1
1 0 1
1 1 0

Fig. 8.4 (b) Truth Table


It produces an output only if either one or the other of the inputs is on, but not if both are
off or both are on. It is shown in above table. We can consider this has a problem that we want
the perceptron to learn to solve; output a 1 of the x is on and y is off or y is on and x is off,
otherwise output a ‘0’. It appears to be a simple enough problem.
We can draw it in pattern space as shown in Fig. (8.5). The x-axis represents the value of
x, the y-axis represents the value of y. The shaded circles represent the inputs that produce an
output of 1, whilst the un-shaded circles show the inputs that produce an output of 0.
Considering the shaded circles and un-shaded circles as separate classes, we find that, we cannot
draw a straight line to separate the two classes. Such patterns are known as linearly inseparable
since no straight line can divide them up successfully. Since we cannot divide them with a single
straight line, the perceptron will not be able to find any such line either, and so cannot solve such
a problem. In fact, a single-layer perceptron cannot solve any problem that is linearly
inseparable.

x y inputs output

0 0 0 0 0 0,1 1,1
1

1 0 1 0 1

0 1 0 1 1
0,0 1,0
1 1 1 1 0
0 1 x
Fig. 8.5 The XOR problem in pattern space

You might also like