lecture 4
lecture 4
Lecture 4:
Lecture 4-2
Introduction
Lecture 4-3
Perceptron(1)
Goal
classifying applied Input x1 , x2 ,..., xm into one of two classes
Procedure
if output of hard limiter is +1, to class C1 if it is -1, to class C2
input mof hard limiter : weighted sum of input
v wi xi x1
i 1 W1 θ
x2
w2
effect of bias b is merely to x3 w3 (.)
w x 0
i 1
i i
Lecture 4-5
Selection of weights for the
Perceptron
In general two basic methods can be employed to
select a suitable weight vector:
By off- line calculation of weights.
If the problem is relatively simple it is often possible to
calculate the weight vector from the specification of the
problem.
By learning procedure.
The weight vector is determined from a given (training) set
of input- output vectors (exemplars) in such a way to
achieve the best classification of the training vectors.
Lecture 4-6
Perceptron Learning Theorem (1)
Linearly separable
if two classes are linearly separable, there exists decision
surface consisting of hyperplane.
If so, there exists weight vector w
Lecture 4-7
Perceptron Learning Theorem (2)
Using modified signal-flow graph
bias θ(n) is treated as synaptic weight driven by fixed input +1
w ( n ) is (n )
0
linear combiner output
m x0 = +1 W 0 = θk
v (n) wi (n) xi ( n) x1 w1
i 0
w2 (v)
w T (n) x( n) x2
vk yk
..
wm
.
xm
Lecture 4-8
Perceptron Learning Theorem (3)
Weight adjustment
if x(n) is correctly classified
w ( n 1) w ( n ) if w T x ( n ) 0 and x ( n ) belongs to class C1
w ( n 1) w ( n ) if w T x ( n ) 0 and x ( n ) belongs to class C 2
otherwise
Lecture 4-9
Summary of Learning
1. Initialization
1. set w(0)=0
2. Activation
1. at time step n, activate perceptron by applying continuous
valued input vector x(n) and desired response d(n)
3. Computation of actual response
y ( n) sgn[ w T ( n ) x( n)]
4. Adaptation of Weight Vector
w ( n 1) w ( n) [ d ( n) y ( n)]x( n)
( n) d ( n) y ( n) : error
1 if x(n) belongs to class C1
5. Continuation d ( n) {
1 if x(n) belongs to class C 2
1. inclement time step n and go back to step 2
Lecture 4-10
The network is capable of solving linearly separable
problem
m
w x
i 1
i i 0
wi xi 0
i 1
m
w x 0
i 1
i i
Lecture 4-11
Learning rule
An algorithm to update the weights w so that finally
the input patterns lie on both sides of the line decided by the
perceptron
w(0) x 0
Lecture 4-12
Learning rule
An algorithm to update the weights w so that finally
the input patterns lie on both sides of the line decided by the
perceptron
w(1) x 0
Lecture 4-13
Learning rule
An algorithm to update the weights w so that finally
the input patterns lie on both sides of the line decided by the
perceptron
w( 2) x 0
Lecture 4-14
Learning rule
An algorithm to update the weights w so that finally
the input patterns lie on both sides of the line decided by the
perceptron
w(3) x 0
Lecture 4-15
Implementation of Logical NOT, AND,
and OR
Lecture 4-16
Implementation of Logical Gate
Lecture 4-17
Finding Weights Analytically for the
AND Network: off-line
Lecture 4-18
Finding Weights by MSE Method:
off-line
Write a equation for each training data
Output for first class is +1 and for second class is -1(or 0)
Apply the MSE method to solve the problem
Example: Implementation of AND gate m
w x 0
i 1
i i
0 0 1 1
0 w1 w1 1
1 1 1
* w2 w2 1
1 0 1 1
1.5
1 1 1 1
Lecture 4-19
Summary: Perceptron vs. MSE
procedures
Lecture 4-20
Perceptron learning law: the
geometric interpretation
Lecture 4-22
Convergence of the Perceptron
learning law (1)
Fixed increment convergence theorem
for linearly separable vectors X1 and X2 , perceptron converges
after some n0 iterations
w ( n 0 ) w ( n 0 1) w ( n 0 2) ...
is solution vector for n0 nmax
proof in case of ( n ) 1
Lecture 4-23
Convergence of the Perceptron
learning law (2)
Assume:
Correct weight vector : |W *| 1 and | X | X C1 1
A small positive fixed number such that W *. X X C1
W * .W
G (W ) 1
|W |
Define
G(w)is the cosine of the angle between W and W*
Consider the behavior of G(w) through adaptation (step4:slide 10)
W * .W (n 1) W * .(W (n) X )
W * .W (n) W * . X
W * .W (n)
Lecture 4-24
Convergence of the Perceptron
learning law (3)
After the nth application
W * .W (n) n
Denominator of G(W) is:
| W (n 1) |2 W (n 1).W (n 1)
(W (n) X ).(W (n) X )
| W (n) |2 2W (n). X | X |2
| W (n) |2 1
| W ( n ) |2 n
W * .W (n) n
G (W (n)) 1
| W ( n) | n n
2
G (W ) 1
Lecture 4-25
Convergence of the Perceptron
learning law (4)
The number of times ,n, that we go to “adaptation
1
step” will still be finite and will be
2
Lecture 4-26
Limitation of Perceptron
The XOR problem (Minsky): nonlinear separability
Lecture 4-27
Perceptron with sigmoid activation
function
For single neuron with step activation function:
Lecture 4-28
Representation of Perceptron in
MATLAB
Lecture 4-29
MATLAB TOOLBOX
net = newp(pr,s,tf,lf)
Description of function
Perceptrons are used to solve simple (i.e. linearly
separable) classification problems.
NET = NEWP(PR,S,TF,LF) takes these inputs,
PR - Rx2 matrix of min and max values for R input
elements.
S - Number of neurons.
TF - Transfer function, default = 'hardlim'.
LF - Learning function, default = 'learnp'.
Lecture 4-30
Classification example: Linear
See the M_file separability
Lecture 4-31
Scatter plot of data learning curve
Lecture 4-32
scatter plot of data after training
Decision boundary
Lecture 4-33
Classification of data :nonlinear
separability
Lecture 4-34
Classification of data :nonlinear
separability
Lecture 4-35
ADALINE — The Adaptive Linear
Element
ADALINE is Perceptron with linear activation function
This is proposed by Widrow
y wi xi X T .W
Lecture 4-36
Applications of Adaline
Lecture 4-37
Error concept
For single neuron
d y
For multi neuron
m is number of output neuron
i d i y i i 1 : m
m1 d m1 y m1
The total measure of the goodness of approximation, or the
performance index, can be specified by the mean- squared error over
m neurons and N training vectors:
N m
1
J (W )
2mN
j (i )
e 2
i 1 j 1 Lecture 4-38
Input data x11 ....x1 p
x 21 ....x 2 p
X N p
......
x N 1 ....x Np
Lecture 4-39
The MSE solution is:
T 1 T
W pm ( X pN X N p ) X pN D N m
1
J (W )
2N
mN E N m
E
m
T
Lecture 4-40
For single neuron m=1:
1 T
J (W ) E1N E N 1
2N
1
J (W ) [( D XW ) T ( D XW )]
2N
1
[( D T W T X T )(( D XW )]
2N
1
[ D T D D T XW W T X T D W T X T XW ]
2N
1
[ D T D 2 D T XW W T X T XW ]
2N
Lecture 4-41
Example 1 0 1 1.1
1 1 1.8
2 1 3.2
3 1 4.1
w1
4 1 4.8
X D
5 1 5.7
W 6 1 7.3
w2 7
8
1
1
7.9
9.2
9 1 9.9
1 w1 285 45 w1
J ( w1 , w2 ) [385.38 2 330 55 w1 w2
20 w2 45 10 w2
1
J ( w1 , w2 ) [385.38 660w1 110 w2 285w12 90w1 w2 10w22 ]
20
Lecture 4-42
The plot of performance index
J(w1,w2) of example
Lecture 4-43
Example 2:the performance
index in general case
w1 3.19
w2 8.24
Lecture 4-44
Method of steepest
descent
If N is large the order of calculation will be high
In order to avoid this problem, we can find the optimal
weight vector for which the mean- squared error, J(w),
attains minimum by iterative modification of the weight
vector for each training exemplar in the direction
opposite to the gradient of the performance index, J(w),
as Illustrated in Figure 4– 5 for a single weight situation.
Lecture 4-45
Illustration of the steepest
descent method
Lecture 4-46
When the weight vector attains the optimal value for which the
gradient is zero (w0 in Figure 4–5), the iterations are stopped.
More precisely, the iterations are specified as
W (n 1) W (n) W (n)
Lecture 4-47
The gradient of performance index is
1
J (W ) [ D T D 2 D T XW W T X T XW ]
2N
J (W ) 1
[ 2 D T X 2 X T XW ]
W 2N
1
[ D T X X T XW ]
N
T
Q D X : Cross Correlation
R X T X : Input Correlation
The second derivative of J which is
known as the Hessian matrix :
2 J
H(w) 2
(J ( J (W )) R
W W
i d i y i i 1 : m
m1 d m1 y m1
Lecture 4-49
For single neuron
( d y ) y (v)
(d y )
1 1
J 2 (d wi xi ) 2 1 1
2 2 J 2 (d (v)) 2
2 2
J
(d wi xi ) xi J J v
(d (v)) (v)
v
wi wi wi v wi wi
J J
wi
wi
( d w x )x
i i i wi
(d (v)) (v) xi
Lecture 4-50
network training
Two types of network training:
Lecture 4-51
Some general comments
on the learning process
Computationally, the learning process goes through all training
examples (an epoch) number of times, until a stopping criterion is
reached.
The convergence process can be monitored with the plot of the
mean- squared error function J(W(n)).
The popular stopping criteria are:
the mean- squared error is sufficiently small:
J (W (n))
The rate of change of the mean- squared error is sufficiently small:
J (W (n))
n
Lecture 4-52
The effect of learning Rate
Lecture 4-53
Applications (1)
MA (Moving average) modeling (filtering)
M
y (n) bi x(n i ) M : Order of Model
i 0
x2 x1 x0 y2
b0
x3 x2 x1 y3
w b1 X D
........ ...
b2
xN xN 1 xN 2 ( N 1)3
yN ( N 1)1
Lecture 4-54
Applications (2)
AR (auto regressive) modeling:
N
y (n) ai y (n i ) bx(n) N : Order of Model
i 1
For N=2 :
y1 y0 x2 y2
a1
y2 y1 x3 y3
w a2 X D
........ ...
b
y N 1 y N 2 xN ( N 1)3
yN ( N 1)1
Lecture 4-55
Applications (3)
PID controller
Desired Model
output
P
Error for Learning
I ADALINE Plant
input Error
for
D
control
Lecture 4-56
Simulation of MA modeling
Supose the MA model as:
1
b 2 M 2;
3
Unknown system
x (Black Box) y
Lecture 4-57
M_file of MA Modeling
Lecture 4-58
MA Modeling
Weight initial : zeros and 0.01
N=20; data training set
Lecture 4-59
MA Modeling
Weight initial : zeros and 0 .1
N=20; data training set
Lecture 4-60
MA Modeling
Weight initial : random and 0.01
N=20; data training set
Lecture 4-61
MA Modeling
Weight initial : random and 0 .1
N=20; data training set
Lecture 4-62
MA Modeling
Weight initial : random and 0.1
N=10; data training set
Lecture 4-63
MATLAB TOOLBOX
net = newlin(PR,S,ID,LR)
Description of function
Linear layers are often used as adaptive filters for signal
processing and prediction.
NEWLIN(PR,S,ID,LR) takes these arguments,
PR - Rx2 matrix of min and max values for R input elements.
S - Number of elements in the output vector.
ID - Input delay vector, default = [0].
LR - Learning rate, default = 0.01;
Lecture 4-64
The linear network is shown as
Lecture 4-65