0% found this document useful (0 votes)
47 views26 pages

Back-Propagation Algorithm

The document describes the back-propagation algorithm for training multi-layer neural networks. It explains that back-propagation uses gradient descent to minimize the error of the network outputs compared to the target outputs on the training data. It does this by propagating error signals backward from the output to the inner layers to update the weights. Specifically, it calculates the deltas for each layer based on the previous layer and uses these to update the weights through gradient descent. This allows multi-layer networks to learn complex patterns unlike single layer perceptrons.

Uploaded by

hackye
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views26 pages

Back-Propagation Algorithm

The document describes the back-propagation algorithm for training multi-layer neural networks. It explains that back-propagation uses gradient descent to minimize the error of the network outputs compared to the target outputs on the training data. It does this by propagating error signals backward from the output to the inner layers to update the weights. Specifically, it calculates the deltas for each layer based on the previous layer and uses these to update the weights through gradient descent. This allows multi-layer networks to learn complex patterns unlike single layer perceptrons.

Uploaded by

hackye
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Back-Propagation Algorithm

Perceptron Gradient Descent Multi-layerd neural network Back-Propagation More on Back-Propagation Examples

Inner-product
r r r r net =< w, x >=|| w ||" || x ||"cos(# )
n

net = # w i " x i
!

i=1

A measure of the projection of one vector onto another

Activation function
n

o = f (net) = f (# w i " x i )
i=1

$ 1 if x " 0 f (x) := sgn(x) = % &#1 if x < 0

!
!

$1 if f (x) := " (x) = % &0 if

x #0 x <0

& 1 if x # 0.5 ( f (x) := " (x) = ' x if 0.5 > x > 0.5 ( 0 if x $ %0.5 )

sigmoid function
!

f (x) := " (x) =

1 1+ e(#ax )

Gradient Descent

To understand, consider simpler linear unit, where


n

o = # wi " x i
i= 0

Let's learn wi that minimize the squared error, D={(x1,t1),(x2,t2), . .,(xd,td),..,(xm,tm)}


(t for target)

Error for different hypothesis, for w0 and w1 (dim 2)

We want to move the weight vector in the direction that decrease E

wi=wi+wi

w=w+w

Differentiating E

Update rule for gradient decent

"w i = # % (t d & od )x id
d $D

!
5

Stochastic Approximation to gradient descent

"w i = #(t $ o)x i


The gradient decent training rule updates summing over all the training examples D Stochastic gradient approximates gradient decent by updating weights incrementally Calculate error for each example Known as delta-rule or LMS (last mean-square) weight update

Adaline rule, used for adaptive lters Widroff and Hoff (1960)

XOR problem and Perceptron

By Minsky and Papert in mid 1960

Multi-layer Networks

The limitations of simple perceptron do not apply to feed-forward networks with intermediate or hidden nonlinear units A network with just one hidden unit can represent any Boolean function The great power of multi-layer networks was realized long ago

But it was only in the eighties it was shown how to make them learn

Multiple layers of cascade linear units still produce only linear functions We search for networks capable of representing nonlinear functions

Units should use nonlinear activation functions Examples of nonlinear activation functions

XOR-example

Back-propagation is a learning algorithm for multi-layer neural networks It was invented independently several times

Bryson an Ho [1969] Werbos [1974] Parker [1985] Rumelhart et al. [1986]

Parallel Distributed Processing - Vol. 1


Foundations David E. Rumelhart, James L. McC lelland and the PDP Research Group What makes people smarter than computers? These volumes by a pioneering neurocomputing.....

10

Back-propagation

The algorithm gives a prescription for changing the weights wij in any feedforward network to learn a training set of input output pairs {xd,td} We consider a simple two-layer network

xk x1 x2 x3 x4 x5

11

Given the pattern xd the hidden unit j receives a net input


5 d net = " w jk x k d j k=1

and produces the output


5

d V = f (net ) = f (" w jk x k ) d j d j k=1

Output unit i thus receives


3 d i d j 3 5 d net = "W ijV = " (W ij # f (" w jk x k )) j=1 j=1 k=1

And produce the nal output


3 3 d j 5 d i

d o = f (net ) = f ("W ijV ) = f (" (W ij # f (" w jk x k ))) d i j=1 j=1 k=1

!
12

Out usual error function

For l outputs and m input output pairs {xd,td}

r 1 m l d E[ w ] = " " (t i # oid ) 2 2 d =1 i=1

In our example E becomes

r 1 m 2 d E[ w ] = " " (t i # oid ) 2 2 d =1 i=1


3 5 r 1 m 2 d d E[ w ] = " " (t i # f ("W ij $ f (" w jk x k ))) 2 2 d =1 i=1 j k=1

!
!

E[w] is differentiable given f is differentiable Gradient descent can be applied

13

For hidden-to-output connections the gradient descent rule gives:


"W ij = #$
m %E = #$& (t id # oid ) f ' (net id ) ' (#V jd ) %W ij d =1 m

"W ij = $& (t id # oid ) f ' (net id ) ' V jd


d =1

"id = f ' (net id )(t id # oid )


m

$W ij = %&"id V jd
d =1

For the input-to hidden connection wjk we must differentiate with respect to the wjk

Using the chain rule we obtain

d m %E %E %V j "w jk = #$ = #$' d & %w jk %V j %w jk d =1

!
14

d "w jk = #$ $ (t id % oid ) f ' (net id )W ij f ' (net d ) & x k j d =1 i=1

"id = f ' (net id )(t id # oid )


!
m 2 d "w jk = #& &$id %W ij f ' (net d ) % x k j d =1 i=1

!
!
!

" = f (net )#W ij"id


d j ' d j i=1

m d "w jk = #%$ d & x k j d =1

"W ij = #%$id V jd
d =1

m d "w jk = #%$ d & x k j

d =1

we have same form with a different denition of

15

In general, with an arbitrary number of layers, the back-propagation update rule has always the form m "w ij = #%$output & Vinput
d =1

Where output and input refers to the connection concerned V stands for the appropriate input (hidden unit or ! real input, xd ) depends on the layer concerned

d By the equation " = f (net )#W ij"i d j ' d j i=1

allows us to determine for a given hidden unit Vj in terms of the s of the unit oi ! The coefcient are usual forward, but the errors are propagated backward

back-propagation

16

We have to use a nonlinear differentiable activation function

Examples:
1 1+ e(#$ % x )

f (x) = " (x) =

f ' (x) = " ' (x) = # $ " (x) $ (1% " (x))

!
!

f (x) = tanh(" # x) f ' (x) = " # (1$ f (x) 2 )

!
17

Consider a network with M layers m=1,2,..,M Vmi from the output of the ith unit of the mth layer V0i is a synonym for xi of the ith input Subscript m layers ms layers, not patterns Wmij mean connection from Vjm-1 to Vim

Stochastic Back-Propagation Algorithm (mostly used)


1. 2.

3.

Initialize the weights to small random values Choose a pattern xdk and apply is to the input layer V0k= xdk for all k Propagate the signal through the network
m Vim = f (net im ) = f (" w ij V jm#1 ) j

4.

5. !

Compute the deltas for the output layer "iM = f ' (net iM )(t id # ViM ) Compute the deltas for the preceding layer for m=M,M-1,..2
"im#1 = f ' (net im#1 )$ w m" m ji j

! 6.
7. !

Update all connections m new old "w ij = #$imV jm%1 w ij = w ij + "w ij Goto 2 and repeat for the next pattern

18

More on Back-Propagation
Gradient descent over entire network weight vector Easily generalized to arbitrary directed graphs Will nd a local, not necessarily global error minimum

In practice, often works well (can run multiple times)

Gradient descent can be very slow if is to small, and can oscillate widely if is to large Often include weight momentum
"w pq (t + 1) = #$

%E + & ' "w pq (t) %w pq

Momentum parameter is chosen between 0 and 1, 0.9 is a good value

19

Minimizes error over training examples

Will it generalize well

Training can take thousands of iterations, it is slow! Using network after training is very fast

20

Convergence of Backpropagation

Gradient descent to some local minimum


Perhaps not global minimum... Add momentum Stochastic gradient descent Train multiple nets with different initial weights

Nature of convergence

Initialize weights near zero Therefore, initial networks near-linear Increasingly non-linear functions possible as training progresses

21

Expressive Capabilities of ANNs

Boolean functions:

Every boolean function can be represented by network with single hidden layer but might require exponential (in number of inputs) hidden units Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989; Hornik et al. 1989] Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988].

Continuous functions:

NETtalk Sejnowski et al 1987

22

Prediction

23

24

Perceptron Gradient Descent Multi-layerd neural network Back-Propagation More on Back-Propagation Examples

25

RBF Networks, Support Vector Machines

26

You might also like