3 DeltaRule PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Lecture 3: Delta Rule

Mathematical Preliminaries: Vector Notation


Vectors appear in lowercase bold font
e.g. input vector: x = [x0 x1 x2 xn]
Dot product of two vectors:
wx = w0 x0 + w1 x1 + + wn xn =
=

wi xi
i 0
0

E.g.: x = [1,2,3], y = [4,5,6] xy = (1*4)+(2*5)+(3*6) = 4+10+18 = 32

Review of the McCulloch


McCulloch-Pitts/Perceptron
Pitts/Perceptron Model
x1

w1

x2

x3
xn

wn

Neuron sums its weighted inputs:


w0 x0 + w1 x1 + + wn xn =

wi xi

=wx=a

i 0

Neuron applies threshold activation function:


y = f(w x)
where, e.g. f(w x) = + 1
f(w x) = - 1

if w x > 0
if w x 0
3

Review of Geometrical Interpretation


x2

y=1

y=-1

x1
wx = 0

Neuron defines two regions in input space where it outputs -1 and 1.


Th regions
The
i
are separated
t db
by a h
hyperplane
l
wx = 0 (i
(i.e. d
decision
i i
boundary)

R i
Review
off Supervised
S
i d Learning
L
i
x
Generator

Supervisor

Learning
Machine

ytarget

Training: Learn from training pairs (x, ytarget)


Testing: Given x, output a value y close to the supervisors output ytarget

Learning b
by Error Minimization
Minimi ation
The Perceptron Learning Rule is an algorithm for adjusting the network
weights w to minimize the difference between the actual and the
desired outputs.
We can define a Cost Function to quantify this difference:

E ( w)

1
p
p 2
(
y

tarj
j )
2 p j

Intuition:
Square makes error positive and penalises large errors more
jjust makes the maths easier
Need to change the weights to minimize the error How?
Use principle of Gradient Descent
6

Principle of Gradient Descent


Gradient
G
di
descent
d
i an optimization
is
i i i algorithm
l i h that
h approaches
h a llocall
minimum of a function by taking steps proportional to the negative of
the gradient of the function as the current point.
E

Error Gradient
So, calculate the derivative (gradient) of the Cost Function with respect
to the weights, and then change each weight by a small increment in
the negative (opposite) direction to the gradient
To do this we need a differentiable activation function, such as the
linear function: f(a) = a

1
E ( w ji ) ( ytarj y j ) 2
2

y j f (a j ) w ji xi
i

E
E y j

( ytarj y j ) xi xi
w ji y j w ji
To reduce E by gradient descent, move/increment weights in the
negative direction to the gradient, -(-x)= +x
8

Widrow-Hoff Learning Rule


(Delta Rule)
w w wold

E
x
w

or

w wold x

where = ytarget y and is a constant that controls the learning rate


(amount of increment/update w at each training step).
Note: Delta rule (DR) is similar to the Perceptron Learning Rule
(PLR), with some differences:
1 Error () in DR is not restricted to having values of 0
1.
0, 1
1, or -1
1
(as in PLR), but may have any value
2. DR can be derived for any differentiable output/activation
function f, whereas in PLR only works for threshold output
function

Note that the rule will be different for not linear f


9

Convergence of PLR/DR
The weight
g changes
g wji need to be applied
pp
repeatedly
p
y for each weight
g wji in
the network and for each training pattern in the training set.
One pass through all the weights for the whole training set is called an epoch
of training.
training
After many epochs, the network outputs match the targets for all the training
patterns all the wji are zero and the training process ceases
patterns,
ceases. We then say
that the training process has converged to a solution.
It has been shown that if a possible set of weights for a Perceptron exist, which
solve
l th
the problem
bl
correctly,
tl th
then th
the Perceptron
P
t
L
Learning
i rule/Delta
l /D lt R
Rule
l
(PLR/DR) will find them in a finite number of iterations.
Furthermore, if the problem is linearly separable
Furthermore
separable, then the PLR/DR will find a
set of weights in a finite number of iterations that solves the problem
correctly.

10

You might also like