Perceptons Neural Networks
Perceptons Neural Networks
=
+ =
n
1 i
i i
b x w net
The output of Perceptron is written as o = f(net)
where f(.) is the activation function of Perceptron. Depending
upon the type of activation function, the Perceptron may be
classified into two types
Discrete perceptron, in which the activation function is hard
limiter or sgn(
.
) function
Continuous perceptron, in which the activation function is
sigmoid function, which is differentiable.
Perceptrons
Linear separability
A set of (2D) patterns (x
1
, x
2
) of two classes is linearly separable
if there exists a line on the (x
1
, x
2
) plane
w
0
+ w
1
x
1
+ w
2
x
2
= 0
Separates all patterns of one class from the other class
A perceptron can be built with
3 input x
0
= 1, x
1
, x
2
with weights w
0
, w
1
, w
2
n dimensional patterns (x
1
,, x
n
)
Hyperplane w
0
+ w
1
x
1
+ w
2
x
2
++ w
n
x
n
= 0 dividing the
space into two regions
Can we get the weights from a set of sample patterns?
If the problem is linearly separable, then YES (by perceptron
learning)
27-08-2014 K.Vasu MITS 4
LINEAR SEPARABILITY
Definition: Two sets of points A and B in an n-dimensional space are called
linearly separable if n+1 real numbers w
1
, w
2
, w
3
, . . . ., w
n+1
exist, such that
every point (x
1
, x
2
, . . . , x
n
)eA satisfies and every point (x
1
, x
2
, . . . , x
n
) eB
satisfies .
Absolute Linear Separability
Two sets of points A and B in an n-dimensional space are called linearly
separable if n+1 real numbers w
1
, w
2
, w
3
, . . . ., w
n+1
exist, such that every
point (x
1
, x
2
, . . . , x
n
) eA satisfies and every point (x
1
, x
2
, . . . , x
n
) eB
satisfies .
Two finite sets of points A and B, in n-dimensional space which are linear
separable are also absolute linearly separable.
In general, absolute linearly separable-> linearly separable
but if sets are finite, linearly separable absolutely linearly separable
27-08-2014 K.Vasu MITS 5
Examples of linearly separable classes
- Logical AND function
patterns (bipolar) decision boundary
x1 x2 output w1 = 1
-1 -1 -1 w2 = 1
-1 1 -1 w0 = -1
1 -1 -1
1 1 1 -1 + x1 + x2 = 0
- Logical OR function
patterns (bipolar) decision boundary
x1 x2 output w1 = 1
-1 -1 -1 w2 = 1
-1 1 1 w0 = 1
1 -1 1
1 1 1 1 + x1 + x2 = 0
27-08-2014 K.Vasu MITS 6
x
o o
o
x: class I (output = 1)
o: class II (output = -1)
x
x o
x
x: class I (output = 1)
o: class II (output = -1)
Single Layer Discrete Perceptron Networks (SLDP)
27-08-2014 K.Vasu MITS 7
Class C
1
Class C
2
x
1
x
2
Fig. 3.2 Illustration of the hyper plane (in this example, a straight line)
as decision boundary for a two dimensional, two-class patron classification problem.
To develop insight into the behavior of a pattern classifier, it is necessary to
plot a map of the decision regions in n-dimensional space, spanned by the n
input variables. The two decision regions separated by a hyper plane defined
by
=
=
n
i
i
w
0
i
0 x
SLDP
27-08-2014 K.Vasu MITS 8
Cla
ss
C
1
Cla
ss
C
2
(b)
Cla
ss
C
1
Cla
ss
C
2
(a)
Decision boundary
Fig (a) A pair of linearly separable patterns
(b) A pair of nonlinearly separable patterns.
For the Perceptron to function properly, the two classes C1 and C2 must be linearly
separable.
In Fig.3.3(a), the two classes C1 and C2 are sufficiently separated from each other to
draw a hyper plane (in this it is a straight line) as the decision boundary.
SLDP
Assume that the input variables originate from two linearly separable classes.
Let
1
be the subset of training vectors X
1
(1), X
1
(2), . , that belongs to class C
1
2
be the subset of training vectors X
2
(1), X
2
(2), . , that belong to class C
2
.
Given the sets of vectors
1
and
2
to train the classifier, the training
process involves the adjustment of the W in such a way that the two
classes C
1
and C
2
are linearly separable. That is, there exists a weight
vector W such that we may write,
)
`
s
>
2
1
C class to belonging X or input vect every for 0
C class to belonging X or input vect every for 0
WX
WX
27-08-2014 K.Vasu MITS 9
SLDP
The algorithm for updating the weights may be formulated as follows:
1. If the k
th
member of the training set, X
k
is correctly classified by the weight vector
W(k) computed at the k
th
iteration of the algorithm, no correction is made to the
weight vector of Perceptron in accordance with the rule.
W
k+1
= W
k
if W
k
X
k
>0 and X
k
belongs to class C
1
W
k+1
= W
k
if W
k
0 X
k
s and X
k
belongs to class C
2
2. Otherwise, the weight vector of the Perceptron is updated in accordance with the
rule.
k
kT ) 1 (
X - W =
+ T k
W
if W
k
X
k
>0 and X
k
belongs to class C
2
k
kT ) 1 (
X W + =
+ T k
W
if W
k
X
k
s 0 and X
k
belongs to class C
1
where the learning rule parameter controls the adjustment applied to the weight vector.
27-08-2014 K.Vasu MITS 10
Discrete Perceptron training algorithm
Consider P number of training patterns are available for training the model as :
{(X
1
, t
1
), (X
2
, t
2
), . . . . (Xp, t
p
)}, where X
i
is the i
th
input vector,
t
i
is the i
th
target output, i = 1, 2, . . . P.
Learning Algorithm
Step 1: Set learning rate (0<1)
Step 2: Initialize the weights and bias at small random values.
Step 3: Set p1, where p indicates the p
th
input vector presented.
27-08-2014 K.Vasu MITS 11
Algorithm continued..
Step 4: Compute the output response
=
=
n
1 i
p
i p
x w
i
k
net
( )
p p
net f O =
where, activation function is
( )
p
net f
For bipolar binary activation function
<
>
= =
u
u
p
p
p p
net if
net if
net f o
1
1
) (
For unipolar binary activation function
>
= =
otherwise
net if
net f o
p
p p
0
1
) (
u
27-08-2014 K.Vasu MITS 12
Algorithm continued..
Step 5: Update the weights
i p p
k
i
k
i
x o t w w ) (
2
1
+ =
+
q
Here, the weights are updated only if the target and output does not match.
Step 6: If p < P, the p p+1, go to step 4 and compute the output response for
the next input, otherwise go to step 7.
Step 7: Test the stopping condition: if weights are not changed stop and store
the final weights (W) and bias (b), else go to step 3.
The network training stops when all the input vectors are correctly classified i.e.
when the target value matches with the output for all the input vectors.
27-08-2014 K.Vasu MITS 13
Example:
Build the Perceptron network to realize fundamental logic gates, such as AND, OR
and XOR.
Solution:
The following steps are included for hand calculations with OR gate input-output data.
Table: OR logic gate function
Input
X
1
X
2
Output
(Target)
0 0 0
0 1 1
1 0 1
1 1 1
Step 1: Initialize weights w
1
= 0.1, w
2
= 0.3;
Step 2: Set learning rate, = 0.1 and threshold value, = 0.2.
S t e p 3 : A p p l y i n p u t p a t t e r n o n e b y o n e a n d r e p e a t t h e s t e p s 4 a n d 5 ,
27-08-2014 K.Vasu MITS 14
27-08-2014 K.Vasu MITS 15
For input 1:
Let us consider the input, X
1
= [0,0] with target, t
1
=0.
Step 4: Compute the net input to the Perceptron, using equation
0 0 3 . 0 0 1 . 0 b x w
2
1 i
0 1
i
0
i 1
= + = + =
=
net
with the bipolar binary activation function, the output obtained as
0 ) 0 (
1
= = f o
Step 5: The output is same as that of target, t
1
= 0, that is, the input pattern is correctly
classified.
Therefore, the weights and bias elements remain as their previous values, that is
updation in weights does not takes place.
Now the weight matrix for next input is w
1
= [0.1 0.3].
27-08-2014 K.Vasu MITS 16
For input 2:
The steps 4 and 5 are repeated for the next input, X
2
= [0, 1] with target, t
2
=1.
The net input obtained as
3 . 0 1 3 . 0 0 1 . 0 b x w
2
1 i
1 2
i
1
i 2
= + = + =
=
net
The corresponding output is obtained as o
2
= f(0.3) = 1
The output is same as that of target, t
2
= 1, that is, the input pattern is correctly
classified. Therefore, the weights and bias elements remain as their previous
values, that is updation in weights and bias does not takes place. Now the weight
matrix for next input is w
1
= [0.1 0.3].
27-08-2014 K.Vasu MITS 17
For input 3:
Repeat steps 4 and 5 for the next input, x
3
= [1,0] with target, t
3
=1.
Compute the net input to the Perceptron and output
1 . 0 0 3 . 0 1 1 . 0 b x w
2
1 i
2 3
i
2
i 3
= + = + =
=
net
o
3
= f(0.1) = 0
The output is not same as target, t
2
= 1 the weights are updated using the equation (3.14)
The weights and bias are updated
2 . 0 1 ) 0 1 ( 1 . 0 1 . 0 ) (
3
1 3
2
1
3
1
= + = + = x o t w w
o
q
3 . 0 0 ) 0 1 ( 1 . 0 3 . 0 ) (
3
2 3
2
2
3
2
= + = + = x o t w w
o
q
So the weights are [0.2 0.3].
For input 4:
Repeat steps 4 and 5 for the next input, x
4
= [1,1] with target, t
3
=1.
Compute the net input to the Perceptron and output
5 . 0 1 3 . 0 1 2 . 0 b x w
2
1 i
3 4
i
3
i 4
= + = + =
=
net
The corresponding output using equation (3.13) obtained as
o
4
= f(0.5) = 1
The output is same as that of target, t
2
= 1, that is, the input pattern is correctly
classified. Therefore, the weights and bias elements remain as their previous values,
that is updation in weights and bias does not takes place. Now the weight matrix after
completion of one cycle is : w
1
= [0.2 0.3].
The summary of weights changes are described in Table 3.2
Table 3.2: The updated weights
Input
X
1
X
2
Net
Output
Target
Updated values
w
1
w
2
0.1 0.3
0 0 0 0 0 0.1 0.3
0 1 0.3 1 1 0.1 0.3
1 0 0.1 0 1 0.2 0.3
1 1 0.5 1 1 0.2 0.3
27-08-2014 K.Vasu MITS 18
1 2 3 4 5 6 7 8 9 10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Number of epochs
E
r
r
o
r
1 2 3 4 5 6 7 8 9 10
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Number of epochs
E
r
r
o
r
Fig. 3.5 The Error profile during the training of
Perceptron to learn input-output relation of
AND gate
Fig. 3.4 The Error profile during the training of
Perceptron to learn input-output relation of OR gate
Results
27-08-2014 K.Vasu MITS 19
0 5 10 15 20 25 30 35 40 45 50
0.5
1
1.5
2
2.5
Number of epochs
E
r
r
o
r
Fig. 3.6 The Error profile during the training of Perceptron
to learn input-output relation of XOR gate
27-08-2014 K.Vasu MITS 20
Single-Layer Continuous Perceptron networks
(SLCP)
The activation function that is used in modeling the Continuous
Perceptron is sigmoidal, which is differentiable.
The two advantages of using continuous activation function are (i)
finer control over the training procedure and (ii) differential
characteristics of the activation function, which is used for
computation of the error gradient.
This gives the scope to use the gradients in modifying the weights. The
gradient or steepest descent method is used in updating weights
starting from any arbitrary weight vector W, the gradient VE(W) of the
current error function is computed.
27-08-2014 K.Vasu MITS 21
Single-Layer Continuous Perceptron networks
updated weight vector may be written as
) E(W - W
k k ) 1 (
V =
+
q
k
W (3.22)
where is learning constant.
T h e e r r o r f u n c t i o n a t s t e p k m a y b e w r i t t e n a s
( )
2
k
o -
2
1
k k
t E =
( 3 . 2 3 a )
o r
E
k
=
( ) | |
2
k
X W f -
2
1
k
t
(3.23b)
27-08-2014 K.Vasu MITS 22
SLCP
The error minimization algorithm (3.22) requires computation of the gradient of the
error function (3.23) and it may be written as
| |
2
k
) f(net -
2
1
) ( t W E
k
V = V (3.24)
The n+1 dimensional gradient vector is defined as
(
(
(
(
(
(
(
(
(
(
(
(
c
c
c
c
c
c
= V
n
k
w
w
w
W E
E
.
.
E
E
) (
1
0
(3.25)
27-08-2014 K.Vasu MITS 23
SLCP
Using (3.24), we obtain the gradient vector as
(
(
(
(
(
(
(
(
(
(
(
(
c
c
c
c
c
c
= V
n
k
k
k
k
w
net
w
net
w
net
W E
) (
.
.
) (
) (
) (net f ) o - (d - ) (
1
0
k
'
k k
(3.26)
Since net
k
= W
k
X, we have
, x
) (
i
=
c
c
i
k
w
net
for i =0, 1, . . . n. (3.27)
(x
0
=1 for bias element) and
27-08-2014 K.Vasu MITS 24
SLCP
equation (3.27) can be written as
)X (net )f o - (t - ) (
k
'
k k
= V
k
W E (3.28a)
or
i k
'
k k
)x (net )f o - (t - =
c
c
i
w
E
for i = 0, 1, . . . n (3.28b)
i k
'
k k
k k
i
)x (net )f o - (t ) E(W - w q q = V = A (3.29)
27-08-2014 K.Vasu MITS 25
SLCP
The gradient (3.28a) can be written as
)X o - (1 ) o - (t
2
1
- ) (
2
k k k
= V
k
W E (3.32)
and the complete delta training for the bipolar continuous activation function results
from (3.32) as
k
2
k k
k ) 1 (
)X o - (1 ) o - (
2
1
W
k
k
t W q + =
+
(3.33)
where k denotes the reinstated number of the training step.
27-08-2014 K.Vasu MITS 26
Perceptron Convergence Theorem
This theorem states that the Perceptron learning law converges to a final set of
weight values in a finite number of steps, if the classes are linear separable.
Proposition: If the sets P and N are finite and linearly separable, the Perceptron
Learning algorithm updates the weight vector
t
w
= ' N P P , where
a normalized
solution vector.
27-08-2014 K.Vasu MITS 28
Perceptron Convergence Theorem
Now, assume that after 1 + t steps the weight vector
1 + t
w
and
-
w
is
*
1
1
cos
w w
w w
t
t
+
+
-
= o
(3.38)
Numerator of equation (3.38): ( )
i t t
p w w w w
+ =
-
+1
*
i t
p w w w
+ =
- -
o + >
-
t
w w
where { } p p p w ' e =
-
min o
27-08-2014 K.Vasu MITS 29
Perceptron Convergence Theorem
Since
-
w
using
i
p
)
2 2 2
1 i t t
p w w
+ s
+
1
2
+ s
t
w
(since
i
p
is normalized)
By induction: ( ) 1
2
0
2
+ + s t w w
t
(3.40)
27-08-2014 K.Vasu MITS 31
Substituting (3.39), (3.40) in (3.38), we get
( )
( ) 1
1
cos 1
2
0
0
+ +
+ +
> >
-
t w
t w w
o
o
( )
1
1
+
+
>
t
t o
= ( )o 1 + t o t >
The right hand side term grows proportionally to t and since 0 > o ; it can
become arbitrarily large. However, since 1 cos s o ; t must be bounded by a
maximum value |
.
|
\
|
s
2
1
o
t .
The number of corrections to the weight vector must be finite.
27-08-2014 K.Vasu MITS 32
Limitations of Perceptron
There are limitations to the capabilities of Perceptron however.
They will learn the solution, if there is a solution to be found.
First, the output values of a Perceptron can take on only one of two
values (True or False).
Second, Perceptron can only classify linearly separable sets of vectors.
If a straight line or plane can be drawn to separate the input vectors
into their correct categories, the input vectors are linearly separable
and the Perceptron will find the solution.
If the vectors are not linearly separable learning will never reach a
point where all vectors are classified properly.
The most famous example of the Perceptions inability to solve
problems with linearly non-separable vectors is the boolean XOR
realization.
27-08-2014 K.Vasu MITS 33