0% found this document useful (0 votes)
2 views

lecture 4

This document provides an introduction to Perceptron and ADALINE, focusing on their learning algorithms, weight selection, and applications in neural networks. It discusses the Perceptron Learning Theorem, convergence, limitations, and the implementation of logical gates. Additionally, it covers the ADALINE model, which utilizes a linear activation function for linear approximations of nonlinear functions.

Uploaded by

Boshra Ismail
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

lecture 4

This document provides an introduction to Perceptron and ADALINE, focusing on their learning algorithms, weight selection, and applications in neural networks. It discusses the Perceptron Learning Theorem, convergence, limitations, and the implementation of logical gates. Additionally, it covers the ADALINE model, which utilizes a linear activation function for linear approximations of nonlinear functions.

Uploaded by

Boshra Ismail
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 65

Introduction to Artificial Neural Networks

Lecture 4:

Perceptron and ADALINE

By: Hussein Kanaan


Outline
 Introduction
 Perceptron
 Selection of weights for the Perceptron
 Perceptron Learning Theorem
 Implementation of Logical Gate
 Finding Weights by MSE Method: off-line
 Perceptron learning law: the geometric interpretation
 Convergence of the Perceptron learning law
 Limitation of Perceptron
 Representation of Perceptron in MATLAB
 ADALINE — The Adaptive Linear Element
 Applications of Adaline
 Error concept
 Method of steepest descent
 The LMS (Widrow- Hoff) Learning Law
 network training
 Some general comments on the learning process
 The effect of learning Rate

Lecture 4-2
Introduction

 The Rosenblatt’s LMS algorithm for Perceptron (1958)


is built around a linear neuron (a neuron with a linear
activation function).
 However, the Perceptron is built around a nonlinear
neuron, namely the McCulloch-Pitts model of a neuron.
 Thisneuron has a hard- limiting activation function (performing
the signum function).
 Recently the term multilayer Perceptron has often been
used as a synonym for the term multilayer feedforward
neural network. In this section we will be referring to
the former meaning.

Lecture 4-3
Perceptron(1)
 Goal
 classifying applied Input x1 , x2 ,..., xm into one of two classes
 Procedure
 if output of hard limiter is +1, to class C1 if it is -1, to class C2
 input mof hard limiter : weighted sum of input
v  wi xi   x1

i 1 W1 θ
x2

w2
 effect of bias b is merely to x3 w3 (.)

shift decision boundary away


vk yk
from origin ..
.  wm
 synaptic weights adapted on
xm
iteration by iteration basis
Lecture 4-4
Perceptron(2)
 Decision regions separated by
a hyperplane
m

 w x   0
i 1
i i

 point (x1,x2) above boundary


line is assigned to C
1

 point (y1,y2) below boundary


line to class C
2

Lecture 4-5
Selection of weights for the
Perceptron
 In general two basic methods can be employed to
select a suitable weight vector:
 By off- line calculation of weights.
 If the problem is relatively simple it is often possible to
calculate the weight vector from the specification of the
problem.
 By learning procedure.
 The weight vector is determined from a given (training) set
of input- output vectors (exemplars) in such a way to
achieve the best classification of the training vectors.

Lecture 4-6
Perceptron Learning Theorem (1)
 Linearly separable
 if two classes are linearly separable, there exists decision
surface consisting of hyperplane.
 If so, there exists weight vector w

w T x  0 for every input vect or x belonging to class C1


w T x 0 for every input vect or x belonging to class C 2
 foronly linearly
separable classes,
perceptron works
well

Lecture 4-7
Perceptron Learning Theorem (2)
 Using modified signal-flow graph
 bias θ(n) is treated as synaptic weight driven by fixed input +1
 w ( n ) is  (n )
0
 linear combiner output

m x0 = +1 W 0 = θk
v (n)  wi (n) xi ( n) x1 w1
i 0
w2 (v)
w T (n) x( n) x2
 vk yk
..

wm
.

xm

Lecture 4-8
Perceptron Learning Theorem (3)
 Weight adjustment
 if x(n) is correctly classified
w ( n  1) w ( n ) if w T x ( n )  0 and x ( n ) belongs to class C1
w ( n  1) w ( n ) if w T x ( n ) 0 and x ( n ) belongs to class C 2
 otherwise

w ( n  1) w ( n )   ( n ) x ( n ) if w T x ( n )  0 and x ( n ) belongs to class C 2


w ( n  1) w ( n )   ( n ) x ( n ) if w T x ( n ) 0 and x ( n ) belongs to class C1
 learningrate parameter  (n) controls adjustment applied to
weight vector

Lecture 4-9
Summary of Learning
1. Initialization
1. set w(0)=0
2. Activation
1. at time step n, activate perceptron by applying continuous
valued input vector x(n) and desired response d(n)
3. Computation of actual response
y ( n) sgn[ w T ( n ) x( n)]
4. Adaptation of Weight Vector

w ( n  1)  w ( n)   [ d ( n)  y ( n)]x( n)
 ( n)  d ( n)  y ( n) : error
 1 if x(n) belongs to class C1
5. Continuation d ( n) {
 1 if x(n) belongs to class C 2
1. inclement time step n and go back to step 2
Lecture 4-10
The network is capable of solving linearly separable
problem
m

w x
i 1
i i   0

 wi xi    0
i 1

 m

 w x   0
i 1
i i

Lecture 4-11
Learning rule
An algorithm to update the weights w so that finally
the input patterns lie on both sides of the line decided by the
perceptron

Let t be the time, at t = 0, we have


w(0)  x 0


Lecture 4-12
Learning rule
An algorithm to update the weights w so that finally
the input patterns lie on both sides of the line decided by the
perceptron

Let t be the time, at t = 1


w(1)  x 0


Lecture 4-13
Learning rule
An algorithm to update the weights w so that finally
the input patterns lie on both sides of the line decided by the
perceptron

Let t be the time, at t = 2

w( 2)  x 0



Lecture 4-14
Learning rule
An algorithm to update the weights w so that finally
the input patterns lie on both sides of the line decided by the
perceptron

Let t be the time, at t = 3

w(3)  x 0



Lecture 4-15
Implementation of Logical NOT, AND,
and OR

Lecture 4-16
Implementation of Logical Gate

Lecture 4-17
Finding Weights Analytically for the
AND Network: off-line

Lecture 4-18
Finding Weights by MSE Method:
off-line
 Write a equation for each training data
 Output for first class is +1 and for second class is -1(or 0)
 Apply the MSE method to solve the problem
 Example: Implementation of AND gate m

 w x   0
i 1
i i

0 0 1   1
0   w1     w1  1 
1 1     1
 *  w2     w2  1 
1 0 1   1
            1.5
1 1 1 1 

Lecture 4-19
Summary: Perceptron vs. MSE
procedures

Lecture 4-20
Perceptron learning law: the
geometric interpretation

w(n) : current weight vector


w(n+1) : next weight vector
w* : correct (desired) weight vector
Lecture 4-21
Perceptron learning law: the
geometric interpretation (cont.)
 During the learning process the current weight vector
w(n) is modified in the direction of the current input
vector x(n), if the input pattern is misclassified, that is,
if the error is non-zero.
 Presenting the Perceptron with enough training
vectors, the weight vector w(n) will tend to the correct
value w*.
 Rosenblatt proved that if input patterns are linearly
separable, then the Perceptron learning law converges,
and the hyperplane separating two classes of input
patterns can be determined.

Lecture 4-22
Convergence of the Perceptron
learning law (1)
 Fixed increment convergence theorem
 for linearly separable vectors X1 and X2 , perceptron converges
after some n0 iterations

w ( n 0 )  w ( n 0  1)  w ( n 0  2) ...
is solution vector for n0 nmax
 proof in case of  ( n ) 1

Lecture 4-23
Convergence of the Perceptron
learning law (2)
 Assume:
 Correct weight vector : |W *| 1 and | X | X C1 1
A small positive fixed number  such that W *. X   X  C1
W * .W
G (W )  1
|W |
 Define

 G(w)is the cosine of the angle between W and W*
 Consider the behavior of G(w) through adaptation (step4:slide 10)

W * .W (n  1) W * .(W (n)  X )
W * .W (n)  W * . X
W * .W (n)  

Lecture 4-24
Convergence of the Perceptron
learning law (3)
 After the nth application

W * .W (n) n
 Denominator of G(W) is:
| W (n  1) |2 W (n  1).W (n  1)
(W (n)  X ).(W (n)  X )
| W (n) |2 2W (n). X  | X |2
| W (n) |2 1
 | W ( n ) |2  n

W * .W (n) n
G (W (n))   1
| W ( n) | n  n 
2
G (W ) 1

Lecture 4-25
Convergence of the Perceptron
learning law (4)
 The number of times ,n, that we go to “adaptation
1
step” will still be finite and will be 
2

Lecture 4-26
Limitation of Perceptron
 The XOR problem (Minsky): nonlinear separability

Lecture 4-27
Perceptron with sigmoid activation
function
 For single neuron with step activation function:

 For single neuron with Sigmoid activation function:

Lecture 4-28
Representation of Perceptron in
MATLAB

Lecture 4-29
MATLAB TOOLBOX
 net = newp(pr,s,tf,lf)
 Description of function
 Perceptrons are used to solve simple (i.e. linearly
separable) classification problems.
 NET = NEWP(PR,S,TF,LF) takes these inputs,
 PR - Rx2 matrix of min and max values for R input
elements.
 S - Number of neurons.
 TF - Transfer function, default = 'hardlim'.
 LF - Learning function, default = 'learnp'.

Returns a new perceptron.

Lecture 4-30
Classification example: Linear
 See the M_file separability

Lecture 4-31
Scatter plot of data learning curve

Lecture 4-32
scatter plot of data after training

Decision boundary

Lecture 4-33
Classification of data :nonlinear
separability

Lecture 4-34
Classification of data :nonlinear
separability

Lecture 4-35
ADALINE — The Adaptive Linear
Element
 ADALINE is Perceptron with linear activation function
 This is proposed by Widrow

y  wi xi  X T .W
Lecture 4-36
Applications of Adaline

 In general, the Adaline is used to perform


 Linear approximation of a “small” segment of a nonlinear hyper-
surface, which is generated by a p– variable function, y = f( x).
In this case, the bias is usually needed.
 Linear filtering and prediction of data (signals);
 Pattern association, that is, generation of m–element output
vectors associated with respective p–element input vectors.

Lecture 4-37
Error concept
 For single neuron

 d  y
 For multi neuron
m is number of output neuron

 i d i  y i i 1 : m
 m1 d m1  y m1
 The total measure of the goodness of approximation, or the
performance index, can be specified by the mean- squared error over
m neurons and N training vectors:
N m
1
J (W ) 
2mN
  j (i )
e 2

i 1 j 1 Lecture 4-38
 Input data x11 ....x1 p
x 21 ....x 2 p
X N p 
......
x N 1 ....x Np

 Desired output or Target d11 ....d1m


w11 ....w1m d 21....d 2 m
DN m 
w21....w2 m ......
 Weight W pm  d N 1....d Nm
......
w p1....w pm

 Equitation X N pW pm YN m

Lecture 4-39
 The MSE solution is:

T 1 T
W pm ( X pN X N p ) X pN D N m

 The Error equitation is:

1
J (W ) 
2N
 mN E N m
E
m
T

Lecture 4-40
 For single neuron m=1:

1 T
J (W )  E1N E N 1
2N

 Replacing error in equation:

1
J (W )  [( D  XW ) T ( D  XW )]
2N
1
 [( D T  W T X T )(( D  XW )]
2N
1
 [ D T D  D T XW  W T X T D  W T X T XW ]
2N
1
 [ D T D  2 D T XW  W T X T XW ]
2N
Lecture 4-41
 Example 1 0 1 1.1
1 1 1.8
2 1 3.2
3 1 4.1

w1
4 1 4.8
X  D 
5 1 5.7
W  6 1 7.3

w2 7
8
1
1
7.9
9.2
9 1 9.9

1 w1 285 45 w1
J ( w1 , w2 )  [385.38  2 330 55  w1 w2
20 w2 45 10 w2

1
J ( w1 , w2 )  [385.38  660w1  110 w2  285w12  90w1 w2  10w22 ]
20

Lecture 4-42
The plot of performance index
J(w1,w2) of example

Lecture 4-43
 Example 2:the performance
index in general case

w1 3.19

w2 8.24

Lecture 4-44
Method of steepest
descent
 If N is large the order of calculation will be high
 In order to avoid this problem, we can find the optimal
weight vector for which the mean- squared error, J(w),
attains minimum by iterative modification of the weight
vector for each training exemplar in the direction
opposite to the gradient of the performance index, J(w),
as Illustrated in Figure 4– 5 for a single weight situation.

Lecture 4-45
Illustration of the steepest
descent method

Lecture 4-46
 When the weight vector attains the optimal value for which the
gradient is zero (w0 in Figure 4–5), the iterations are stopped.
More precisely, the iterations are specified as

W (n  1) W (n)  W (n)

 where the weight adjustment, .w(n), is proportional to the gradient


of the mean-squared error
W (n)  J (W (n))
 where η is a learning gain.

Lecture 4-47
 The gradient of performance index is
1
J (W )  [ D T D  2 D T XW  W T X T XW ]
2N
J (W ) 1
 [ 2 D T X  2 X T XW ]
W 2N
1
 [ D T X  X T XW ]
N
T
Q  D X : Cross Correlation
R  X T X : Input Correlation
The second derivative of J which is
known as the Hessian matrix :
2 J 
H(w)  2
 (J ( J (W ))  R
W W

 And must be calculated in each iteration ,n=1:N.


 This is complex but there is a method for applying pervious
calculation to next calculation (recursive)
Lecture 4-48
The LMS (Widrow- Hoff)
Learning Law
 The Least- Mean- Square learning law replaces the gradient of the
mean- squared error with the gradient update and can be written in
following form:

W pm (n) xTp1 (n) T 1m (n)

 i d i  y i i 1 : m
 m1 d m1  y m1

W (n  1) W (n) x (n) (n)


T T

Lecture 4-49
 For single neuron

For linear neuron for nonlinear neuron


y  wi xi v  wi xi

 ( d  y ) y  (v)
 (d  y )
1 1
J   2  (d   wi xi ) 2 1 1
2 2 J   2  (d   (v)) 2
2 2
J 
   (d   wi xi ) xi J J v
  (d   (v)) (v)
v
wi wi wi v wi wi
J  J
wi
 
wi
 ( d   w x )x
i i i wi
 (d   (v)) (v) xi

Lecture 4-50
network training
 Two types of network training:

 Sequential mode or incremental (on-line, stochastic, or


per-pattern):
Weights updated after each pattern is presented
 Batch mode (off-line or per-epoch) :
Weights updated after all pattern is presented

Lecture 4-51
Some general comments
on the learning process
 Computationally, the learning process goes through all training
examples (an epoch) number of times, until a stopping criterion is
reached.
 The convergence process can be monitored with the plot of the
mean- squared error function J(W(n)).
 The popular stopping criteria are:
 the mean- squared error is sufficiently small:

J (W (n)) 
 The rate of change of the mean- squared error is sufficiently small:

J (W (n))

n

Lecture 4-52
The effect of learning Rate 

Lecture 4-53
Applications (1)
 MA (Moving average) modeling (filtering)
M
y (n)  bi x(n  i ) M : Order of Model
i 0

y : [ y (0) y (1) y (2) ..... y ( N )]


x : [ x(0) x(1) x(2) ..... x( N )]
 For M=2 :

x2 x1 x0 y2
b0
x3 x2 x1 y3
w  b1 X D
........ ...
b2
xN xN  1 xN  2 ( N  1)3
yN ( N  1)1

Lecture 4-54
Applications (2)
 AR (auto regressive) modeling:

N
y (n)  ai y (n  i )  bx(n) N : Order of Model
i 1

 For N=2 :

y1 y0 x2 y2
a1
y2 y1 x3 y3
w  a2 X D
........ ...
b
y N  1 y N  2 xN ( N  1)3
yN ( N  1)1

Lecture 4-55
Applications (3)
 PID controller
Desired Model
output
P
Error for Learning
I ADALINE Plant
input Error
for
D
control

Lecture 4-56
Simulation of MA modeling
 Supose the MA model as:
1
b  2 M 2;
3

 Input is Gaussian noise with mean=0 and var=1


 y is Calculated by recursive equation
 Please see the M_file

Unknown system
x (Black Box) y

Lecture 4-57
M_file of MA Modeling

Lecture 4-58
MA Modeling
 Weight initial : zeros and  0.01
 N=20; data training set

Lecture 4-59
MA Modeling
 Weight initial : zeros and   0 .1
 N=20; data training set

Lecture 4-60
MA Modeling
 Weight initial : random and  0.01
 N=20; data training set

Lecture 4-61
MA Modeling
 Weight initial : random and   0 .1
 N=20; data training set

Lecture 4-62
MA Modeling
 Weight initial : random and  0.1
 N=10; data training set

Lecture 4-63
MATLAB TOOLBOX
 net = newlin(PR,S,ID,LR)
 Description of function
 Linear layers are often used as adaptive filters for signal
processing and prediction.
 NEWLIN(PR,S,ID,LR) takes these arguments,
 PR - Rx2 matrix of min and max values for R input elements.
 S - Number of elements in the output vector.
 ID - Input delay vector, default = [0].
 LR - Learning rate, default = 0.01;

and returns a new linear layer.

Lecture 4-64
 The linear network is shown as

Lecture 4-65

You might also like