Back-Propagation Algorithm
Back-Propagation Algorithm
Perceptron Gradient Descent Multi-layerd neural network Back-Propagation More on Back-Propagation Examples
Inner-product
r r r r net =< w, x >=|| w ||" || x ||"cos(# )
n
net = # w i " x i
!
i=1
Activation function
n
o = f (net) = f (# w i " x i )
i=1
!
!
x #0 x <0
& 1 if x # 0.5 ( f (x) := " (x) = ' x if 0.5 > x > 0.5 ( 0 if x $ %0.5 )
sigmoid function
!
1 1+ e(#ax )
Gradient Descent
o = # wi " x i
i= 0
wi=wi+wi
w=w+w
Differentiating E
"w i = # % (t d & od )x id
d $D
!
5
The gradient decent training rule updates summing over all the training examples D Stochastic gradient approximates gradient decent by updating weights incrementally Calculate error for each example Known as delta-rule or LMS (last mean-square) weight update
Adaline rule, used for adaptive lters Widroff and Hoff (1960)
Multi-layer Networks
The limitations of simple perceptron do not apply to feed-forward networks with intermediate or hidden nonlinear units A network with just one hidden unit can represent any Boolean function The great power of multi-layer networks was realized long ago
But it was only in the eighties it was shown how to make them learn
Multiple layers of cascade linear units still produce only linear functions We search for networks capable of representing nonlinear functions
Units should use nonlinear activation functions Examples of nonlinear activation functions
XOR-example
Back-propagation is a learning algorithm for multi-layer neural networks It was invented independently several times
10
Back-propagation
The algorithm gives a prescription for changing the weights wij in any feedforward network to learn a training set of input output pairs {xd,td} We consider a simple two-layer network
xk x1 x2 x3 x4 x5
11
!
12
!
!
13
$W ij = %&"id V jd
d =1
For the input-to hidden connection wjk we must differentiate with respect to the wjk
!
14
!
!
!
"W ij = #%$id V jd
d =1
d =1
15
In general, with an arbitrary number of layers, the back-propagation update rule has always the form m "w ij = #%$output & Vinput
d =1
Where output and input refers to the connection concerned V stands for the appropriate input (hidden unit or ! real input, xd ) depends on the layer concerned
allows us to determine for a given hidden unit Vj in terms of the s of the unit oi ! The coefcient are usual forward, but the errors are propagated backward
back-propagation
16
Examples:
1 1+ e(#$ % x )
f ' (x) = " ' (x) = # $ " (x) $ (1% " (x))
!
!
!
17
Consider a network with M layers m=1,2,..,M Vmi from the output of the ith unit of the mth layer V0i is a synonym for xi of the ith input Subscript m layers ms layers, not patterns Wmij mean connection from Vjm-1 to Vim
3.
Initialize the weights to small random values Choose a pattern xdk and apply is to the input layer V0k= xdk for all k Propagate the signal through the network
m Vim = f (net im ) = f (" w ij V jm#1 ) j
4.
5. !
Compute the deltas for the output layer "iM = f ' (net iM )(t id # ViM ) Compute the deltas for the preceding layer for m=M,M-1,..2
"im#1 = f ' (net im#1 )$ w m" m ji j
! 6.
7. !
Update all connections m new old "w ij = #$imV jm%1 w ij = w ij + "w ij Goto 2 and repeat for the next pattern
18
More on Back-Propagation
Gradient descent over entire network weight vector Easily generalized to arbitrary directed graphs Will nd a local, not necessarily global error minimum
Gradient descent can be very slow if is to small, and can oscillate widely if is to large Often include weight momentum
"w pq (t + 1) = #$
19
Training can take thousands of iterations, it is slow! Using network after training is very fast
20
Convergence of Backpropagation
Perhaps not global minimum... Add momentum Stochastic gradient descent Train multiple nets with different initial weights
Nature of convergence
Initialize weights near zero Therefore, initial networks near-linear Increasingly non-linear functions possible as training progresses
21
Boolean functions:
Every boolean function can be represented by network with single hidden layer but might require exponential (in number of inputs) hidden units Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989; Hornik et al. 1989] Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988].
Continuous functions:
22
Prediction
23
24
Perceptron Gradient Descent Multi-layerd neural network Back-Propagation More on Back-Propagation Examples
25
26