Handout Delta Rule
Handout Delta Rule
=
w1x1 + w2x2 =
x2 = -( w1 / w2 )x1 + ( / w2 )
Substituting, we obtain
x2 = -( 0.25 / 0.25 )x1 + (0.25 / 0.25)
x2 = -x1 + 1
i.e.,
x1 x2
0 1
1 0
w1 = -0.25 = w2
= +0.25
x2 = -( w1 / w2 )x1 + ( / w2)
x2 = -x1 - 1
i.e.,
x1 x2
0 -1
1 -2
Second Epoch:
w1 w2 x1 x2 a y t (t-y) w1 w2
-.25 -.25 .25 0 0 0 0 1 .5(1-0)=.5 0 0 -.5
-.25 -.25 -.25 0 1 -.25 1 1 0 0 0 0
-.25 -.25 -.25 1 0 -.25 1 1 0 0 0 0
-.25 -.25 -.25 1 1 -.5 0 0 0 0 0 0
w1 = w2 = -0.25
= -0.25
x2 = -( w1 / w2)x1 + ( / w2 )
x2 = -x1 + 1
i.e.,
x1 x2
0 1
1 0
Third Epoch:
w1 w2 x1 x2 a y t (t-y) w1 w2
-.25 -.25 -.25 0 0 0 1 1 0 0 0 0
-.25 -.25 -.25 0 1 -.25 1 1 0 0 0 0
-.25 -.25 -.25 1 0 -.25 1 1 0 0 0 0
-.25 -.25 -.25 1 1 -.25 0 0 0 0 0 0
1. Capability to train all the weights in multilayer nets with no a priori knowledge of the training
set.
2. Based on defining a measure of the difference between the actual network output and target
vector.
i.e., y = y(x).
Slope of a Function
y'(x = P).
If x is small enough, y = y .
y=( y/ x) x
Furthermore, y y.
dy - (dy/dx)2 ( ** )
The quantity (dy/dx)2 is positive.
Hence, the quantity - (dy/dx)2 must be negative.
y < 0.
i.e., we have "traveled down" the curve towards the minimal point.
If we keep repeating steps such as ( ** ), then we should approach the value x0 associated with the
function minimum.
One may speak of the slope of the function, or its rate of change, with respect to each of these
variables independently.
xi = - ( y / xi ).
There is an equation like this for each variable, and all of them must be used to ensure that y < 0
and there is gradient descent.
wi = - ( E / wi ).
Suppose we assign equal importance to the error for each pattern, so that if ep is the error for training
pattern p, then the total error E is just the average or mean over all N patterns.
One attempt to define e p as simply the difference e p = t p - y p , where y p is the TLU output in
response to p.
However ... then the error is smaller for t p=0, y p=1 than for t p=1, y p=0. They're equally wrong.
We next try
e p = ( t p - y p )2.
One remedy:
e p = ( t p - a p )2.
When using the augmented weight vector, the output changes as the activation changes sign
i.e.,
a >= 0 y = 1.
As long as the activation takes on the correct sign, the target output is guaranteed and we are
free to choose two arbitrary numbers, one positive, and one negative, as the activation targets.
e p = 1/2 ( t p - a p )2.
and thus,
The Error E depends on all the patterns. So do all its derivatives. Hence ,the whole training set
needs to be presented in order to evaluate the gradients E / wi
This is batch training - results in true gradient descent, but is computationally intensive.
Instead ... adapt the weights based on the presentation of each pattern individually.
Recall that:
e p = 1/2 ( t p - a p )2
and
a p = w1x1 p + w2x2 p + ... + wn+1xn+1 p
1. The gradient must depend in some way on ( t p - a p ). The larger this is, the larger we expect
the gradient to be.
If this difference is zero, then the gradient is also zero, since we have found the minimum value
of ep.
2. The gradient must depend on the input xi p, for if this is zero, then the ith input is making no
contribution to the activation for the pth pattern - and cannot affect the error. No matter how wi
changes, it makes no difference to ep.
Conversely, if xi p is large, then the ith input is correspondingly sensitive to the value of wi.
ep / wi = -( t p - a p )xi p
use as an estimate ,
w i = - ( E / wi )
we obtain,
wi = - (t p - a p) xi p
Pattern Training Regime: weight changes are made after each vector presentation.
We are using estimates for the true gradient. The progress in the minimization of E is noisy.
i.e., weight changes are sometimes made which increase E.
This is the Widrow-Hoff Rule, now refered to as the Delta Rule (or -rule.)
Widrow and Hoff first proposed this training regime (1960.) They trained ADALINES
(ADAptive LINear ElementS,) which is a TLU, except that the input and output signals were
bipolar (i.e., {-1,1}.)
If the learning rate is sufficiently small, then the delta rule converges.
I.e., the weight vector approaches the vector w0, for which the error is a minimum, and E itself
approaches a constant value.
Note: A solution will not exist if the problem is not linearly separable.
Then w0 is the best the TLU can do, and some patterns will be incorrectly classified.
Also note, the delta rule will always make changes to weights, no matter how small (because
target activation values 1 will never be attained exactly.)
Train a two-input TLU with initial weights (0, 0.4) and threshold 0.3, using a learning rate = 0.25.
(The AND function)
First Epoch
w1 w2 x1 x2 a t w1 w2
(1)
0.00 0.40 0.30 0 0 -0.30 -1.00 -0.17 -0.00 -0.00
0.17
(2) (3)
0.00 0.40 0.48 0 1 -0.08 -1.00 -0.23 -0.00
-0.23 0.23
(4) (5)
0.00 0.17 0.71 1 0 -0.71 -1.00 -0.07 -0.00
-0.07 0.07
(6) (7) (8)
0.07 0.17 0.78 1 1 -0.68 1.00 0.42
0.42 0.42 -0.42
We employ wi = + (t p - a p) xi p.
Note the plus sign before : Always travel in the opposite direction of gradient.
Second Epoch
w1 w2 x1 x2 a t w1 w2
0.35 0.59 0.36 0 0
0 1
1 0
1 1