0% found this document useful (0 votes)
5 views10 pages

Handout Delta Rule

The document explains the Perceptron Learning algorithm using a two-input NAND example, detailing the weight adjustments through multiple epochs. It introduces the Delta Rule for training weights in multilayer networks and discusses the concept of gradient descent for minimizing errors. The document also provides an example of training a two-input TLU using the Delta Rule, illustrating the weight updates and their calculations.

Uploaded by

5873.2021
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views10 pages

Handout Delta Rule

The document explains the Perceptron Learning algorithm using a two-input NAND example, detailing the weight adjustments through multiple epochs. It introduces the Delta Rule for training weights in multilayer networks and discusses the concept of gradient descent for minimizing errors. The document also provides an example of training a two-input TLU using the Delta Rule, illustrating the weight updates and their calculations.

Uploaded by

5873.2021
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Lesson 4

Perceptron Learning - An example


A Two-Input NAND

x1 x2 x1 NAND x2 Let w1 = w2 = = 0.25 to begin.


0 0 1
0 1 1
1 0 1
1 1 0

=
w1x1 + w2x2 =
x2 = -( w1 / w2 )x1 + ( / w2 )

Substituting, we obtain
x2 = -( 0.25 / 0.25 )x1 + (0.25 / 0.25)

x2 = -x1 + 1

i.e.,
x1 x2
0 1
1 0

The First Epoch:


w1 w2 x1 x2 a y t (t-y) w1 w2
.25 .25 .25 0 0 0 0 1 .5(1-0)=.5 0 0 -.5
.25 .25 -.25 0 1 .25 1 1 .5(1-1)=0 0 0 0
.25 .25 -.25 1 0 .25 1 1 0 0 0 0
.25 .25 -.25 1 1 .5 1 0 .5(0-1)=-.5 -.5 -.5 .5
After the First Epoch

w1 = -0.25 = w2
= +0.25

x2 = -( w1 / w2 )x1 + ( / w2)

x2 = -x1 - 1

i.e.,
x1 x2
0 -1
1 -2

Second Epoch:
w1 w2 x1 x2 a y t (t-y) w1 w2
-.25 -.25 .25 0 0 0 0 1 .5(1-0)=.5 0 0 -.5
-.25 -.25 -.25 0 1 -.25 1 1 0 0 0 0
-.25 -.25 -.25 1 0 -.25 1 1 0 0 0 0
-.25 -.25 -.25 1 1 -.5 0 0 0 0 0 0

After the Second Epoch

w1 = w2 = -0.25
= -0.25

x2 = -( w1 / w2)x1 + ( / w2 )
x2 = -x1 + 1

i.e.,

x1 x2
0 1
1 0
Third Epoch:
w1 w2 x1 x2 a y t (t-y) w1 w2
-.25 -.25 -.25 0 0 0 1 1 0 0 0 0
-.25 -.25 -.25 0 1 -.25 1 1 0 0 0 0
-.25 -.25 -.25 1 0 -.25 1 1 0 0 0 0
-.25 -.25 -.25 1 1 -.25 0 0 0 0 0 0

Since there have been no changes, Halt!

The Delta Rule


We desire:

1. Capability to train all the weights in multilayer nets with no a priori knowledge of the training
set.

2. Based on defining a measure of the difference between the actual network output and target
vector.

3. This difference is then treated as an error to be minimized by adjusting the weights.

Finding the Minimum of a Function: Gradient Descent (informed hillclimbing?)

Suppose that quantity y depends on a single variable x.

i.e., y = y(x).

We wish to find x0 which minimizes x.

i.e., y(x0) <= y(x) , x.


 Let x* be current best estimate for x0.
 To obtain a better estimate for x0, choose x so as to follow the function downhill.
 We need to know the slope of the function at x*:

Slope of a Function

The slope at any point x is just the slope of a straight


line, the tangent, which just grazes the curve at that
point.

1. Here, one may draw the function on graph paper


2. Draw the tangent at the point P
3. Measure the sides x, y, or merely calculate:

y'(x = P).

If x is small enough, y = y .

Dividing y by x, and then multiplying by x leaves y unchanged.

y=( y/ x) x

Furthermore, y y.

Hence, we may write: y = slope x.


That is,

y = (dy / dx) x. (*)

That is, the derivative of y with respect to x.

Suppose we can evaluate the slope or derivative of y and put

x = - (dy/dx) , where > 0 and is small enough to ensure that y y.

Then, substituting this is in ( * ), we get

dy - (dy/dx)2 ( ** )
The quantity (dy/dx)2 is positive.
Hence, the quantity - (dy/dx)2 must be negative.

y < 0.

i.e., we have "traveled down" the curve towards the minimal point.

If we keep repeating steps such as ( ** ), then we should approach the value x0 associated with the
function minimum.

This is Gradient Descent.

Its effectiveness hinges on the ability to calculate or make estimates of dy/dx.

Functions of More Than One Variable

Suppose y = y(x1, x2, ..., xn).

One may speak of the slope of the function, or its rate of change, with respect to each of these
variables independently.

The slope or derivative of a function y with respect to the variable xi is:

y / xi The partial derivative.

The equivalent is then

xi = - ( y / xi ).

There is an equation like this for each variable, and all of them must be used to ensure that y < 0
and there is gradient descent.

Gradient Descent on an Error.

 Consider a network consisting of a single TLU.


 Assume supervised learning.
i.e., for every input pattern, p, in the training set there is a corresponding target t p.
 The augmented weight vector, , completely characterizes the behavior of the network.
 Any function, E, that expresses the discrepency between desired and actual network output,
may be considered as a function of the weights.
i.e.,
E = E(w1, w2, ..., wn+1).
The
optimal
weight
vector
is
found
by
minimizing
this
function
E
by
gradient
descent.

wi = - ( E / wi ).

We need to find a suitable error E.

Suppose we assign equal importance to the error for each pattern, so that if ep is the error for training
pattern p, then the total error E is just the average or mean over all N patterns.

One attempt to define e p as simply the difference e p = t p - y p , where y p is the TLU output in
response to p.

However ... then the error is smaller for t p=0, y p=1 than for t p=1, y p=0. They're equally wrong.

We next try

e p = ( t p - y p )2.

 A subtle problem remains:


With gradient descent, it is assumed that the function to be minimized depends on its variables
in a smooth, continuous fashion.
First, the activation ap is simply the weighted sum of inputs. This is smooth and continuous.
But, the output depends on ap via the discontinuous step function.

 One remedy:
e p = ( t p - a p )2.

We must be careful how we define the targets.


We have used {0,1} heretofore.

When using the augmented weight vector, the output changes as the activation changes sign
i.e.,

a >= 0 y = 1.

 As long as the activation takes on the correct sign, the target output is guaranteed and we are
free to choose two arbitrary numbers, one positive, and one negative, as the activation targets.

{1, -1} are customary.

 One last modification:


A factor of 1/2 is added to the error expression - simplifies the resulting slope or derivative.

e p = 1/2 ( t p - a p )2.

and thus,

The Delta Rule.

 The Error E depends on all the patterns. So do all its derivatives. Hence ,the whole training set
needs to be presented in order to evaluate the gradients E / wi

 This is batch training - results in true gradient descent, but is computationally intensive.

 Instead ... adapt the weights based on the presentation of each pattern individually.

i.e., we present the net with a pattern p,


evaluate e p / wi,
and use this as an estimate of the true gradient E / wi

 Recall that:
e p = 1/2 ( t p - a p )2

and
a p = w1x1 p + w2x2 p + ... + wn+1xn+1 p

ep / wi = -( t p - a p )xi p , where xi p is the ith component of pattern p.

1. The gradient must depend in some way on ( t p - a p ). The larger this is, the larger we expect
the gradient to be.
If this difference is zero, then the gradient is also zero, since we have found the minimum value
of ep.
2. The gradient must depend on the input xi p, for if this is zero, then the ith input is making no
contribution to the activation for the pth pattern - and cannot affect the error. No matter how wi
changes, it makes no difference to ep.

Conversely, if xi p is large, then the ith input is correspondingly sensitive to the value of wi.

ep / wi = -( t p - a p )xi p

use as an estimate ,

w i = - ( E / wi )

we obtain,
wi = - (t p - a p) xi p

 Pattern Training Regime: weight changes are made after each vector presentation.

 We are using estimates for the true gradient. The progress in the minimization of E is noisy.
i.e., weight changes are sometimes made which increase E.

 This is the Widrow-Hoff Rule, now refered to as the Delta Rule (or -rule.)

 Widrow and Hoff first proposed this training regime (1960.) They trained ADALINES
(ADAptive LINear ElementS,) which is a TLU, except that the input and output signals were
bipolar (i.e., {-1,1}.)

 If the learning rate is sufficiently small, then the delta rule converges.
I.e., the weight vector approaches the vector w0, for which the error is a minimum, and E itself
approaches a constant value.

 Note: A solution will not exist if the problem is not linearly separable.

 Then w0 is the best the TLU can do, and some patterns will be incorrectly classified.

 (Note the difference with the Perceptron rule !!!)

 Also note, the delta rule will always make changes to weights, no matter how small (because
target activation values 1 will never be attained exactly.)

The Delta Rule Algorithm


Begin
Repeat
For each training vector pair (V, t)
Evaluate the activation a when V is input to the TLU
Adjust each of the weights
End For
Until the rate of change of the error is sufficiently small
End

The Delta Rule - An Example

Train a two-input TLU with initial weights (0, 0.4) and threshold 0.3, using a learning rate = 0.25.
(The AND function)

First Epoch
w1 w2 x1 x2 a t w1 w2
(1)
0.00 0.40 0.30 0 0 -0.30 -1.00 -0.17 -0.00 -0.00
0.17
(2) (3)
0.00 0.40 0.48 0 1 -0.08 -1.00 -0.23 -0.00
-0.23 0.23
(4) (5)
0.00 0.17 0.71 1 0 -0.71 -1.00 -0.07 -0.00
-0.07 0.07
(6) (7) (8)
0.07 0.17 0.78 1 1 -0.68 1.00 0.42
0.42 0.42 -0.42

After the first epoch, w1 = 0.35, w2 = 0.59, = 0.36

We employ wi = + (t p - a p) xi p.
Note the plus sign before : Always travel in the opposite direction of gradient.

(1) = +0.25(-1.00 - (-0.30) ) (-1) -1 is the input to .


= -0.25(-0.7) = 0.175 (sign?)!

(2) w2 = -0.25(-1.00 - (0.08) ) * 1


= -0.25(-0.92) = 0.23. ( (3) will have the opposite sign.)

(4) w1 = -0.25(-1.00 - (-0.71) ) * 1


= -0.25(-0.29) = 0.07. ( (5) will have the opposite sign.)

(8) = -0.25(1.00 - (-0.68) )(-1)


= -0.25 (1.68) = -0.42 ( (6), (7) opposite sign.)

Since, after the first epoch, we have


w1 = 0.35,
w2 = 0.59,
= 0.36,

x2 = -(0.35 / 0.59) x1 + (0.36 / 0.59) = -0.59 x1 + 0.6


i.e., slope is -0.59.

Second Epoch
w1 w2 x1 x2 a t w1 w2
0.35 0.59 0.36 0 0
0 1
1 0
1 1

You might also like