0% found this document useful (0 votes)
98 views64 pages

Multi Layer Perceptron

This document discusses the multi-layer perceptron model and the backpropagation algorithm. It covers the MLP model structure with input, hidden and output layers. It then describes the backpropagation algorithm in detail, including calculating the error signal, defining the cost function, deriving the gradient descent learning rule, and distinguishing the calculations for output versus hidden neurons. The goal is to optimize network weights using gradient descent and backpropagation of error signals from the output to hidden layers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views64 pages

Multi Layer Perceptron

This document discusses the multi-layer perceptron model and the backpropagation algorithm. It covers the MLP model structure with input, hidden and output layers. It then describes the backpropagation algorithm in detail, including calculating the error signal, defining the cost function, deriving the gradient descent learning rule, and distinguishing the calculations for output versus hidden neurons. The goal is to optimize network weights using gradient descent and backpropagation of error signals from the output to hidden layers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

WK3 – Multi Layer Perceptron

Contents
CS 476: Networks of Neural Computation
MLP Model

BP Algorithm WK3 – Multi Layer Perceptron


Approxim.

Model Selec.
BP & Opt.

Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Feature Detection

Contents
MLP Model

BP Algorithm
Approxim.

Model Selec.
BP & Opt.

Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Contents
MLP Model

BP Algorithm
Approxim.

Model Selec.
BP & Opt.

Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Contents
MLP Model

BP Algorithm
Approxim.

Model Selec.
BP & Opt.

Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Contents
MLP Model

BP Algorithm
Approxim.

Model Selec.
BP & Opt.

Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Contents
MLP Model

BP Algorithm
Approxim.

Model Selec.
BP & Opt.

Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Contents

•MLP model details


Contents
•Back-propagation algorithm
MLP Model
•XOR Example
BP Algorithm
•Heuristics for Back-propagation
Approxim.
•Heuristics for learning rate
Model Selec.
BP & Opt.
•Approximation of functions

Conclusions •Generalisation
•Model selection through cross-validation
•Conguate-Gradient method for BP

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Contents II

•Advantages and disadvantages of BP


Contents
•Types of problems for applying BP
MLP Model
•Conclusions
BP Algorithm
Approxim.

Model Selec.
BP & Opt.

Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Multi Layer Perceptron

•“Neurons” are positioned in layers. There are Input,


Contents Hidden and Output Layers
MLP Model

BP Algorithm
Approxim.

Model Selec.
BP & Opt.

Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Multi Layer Perceptron Output

•The output y is calculated by:


Contents
 m 
MLP Model y j (n)   j ( v j ( n ))   j   w ji ( n ) y i ( n ) 
BP Algorithm
 i 0 
Approxim. Where w0(n) is the bias.
Model Selec.
BP & Opt.
•The function j(•) is a sigmoid function. Typical
examples are:
Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Transfer Functions

•The logistic sigmoid:


Contents
1
MLP Model y 
1  exp( x)
BP Algorithm
Approxim.

Model Selec.
BP & Opt.

Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Transfer Functions II

•The hyperbolic tangent sigmoid:


Contents
(exp( x )  exp(  x ))
MLP Model sinh( x ) 2
y  tanh( x)  
BP Algorithm cosh( x ) (exp( x )  exp(  x ))
2
Approxim.

Model Selec.
BP & Opt.

Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Learning Algorithm

•Assume that a set of examples ={x(n),d(n)}, n=1,…,N


Contents is given. x(n) is the input vector of dimension m0 and
MLP Model d(n) is the desired response vector of dimension M
BP Algorithm •Thus an error signal, ej(n)=dj(n)-yj(n) can be defined
Approxim.
for the output neuron j.
•We can derive a learning algorithm for an MLP by
Model Selec.
assuming an optimisation approach which is based on
BP & Opt. the steepest descent direction, I.e.
Conclusions w(n)=-g(n)
Where g(n) is the gradient vector of the cost function
and  is the learning rate.

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Learning Algorithm II

•The algorithm that it is derived from the steepest


Contents descent direction is called back-propagation
MLP Model •Assume that we define a SSE instantaneous cost
function (I.e. per example) as follows:
BP Algorithm
Approxim. 1
( n )  
2
ej (n)
Model Selec. 2 j C

BP & Opt. Where C is the set of all output neurons.


Conclusions •If we assume that there are N examples in the set 
then the average squared error is:
N
1
 av 
N
 ( n )
n 1

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Learning Algorithm III

•We need to calculate the gradient wrt Eav or wrt to E(n).


Contents
In the first case we calculate the gradient per epoch (i.
MLP Model e. in all patterns N) while in the second the gradient is
calculated per pattern.
BP Algorithm
•In the case of Eav we have the Batch mode of the
Approxim.
algorithm. In the case of E(n) we have the Online or
Model Selec. Stochastic mode of the algorithm.
BP & Opt. •Assume that we use the online mode for the rest of
the calculation. The gradient is defined as:
Conclusions
 ( n )
g (n) 
w ji ( n )

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Learning Algorithm IV

•Using the chain rule of calculus we can write:


Contents
( n ) ( n ) e j ( n ) y j ( n ) v j ( n )
MLP Model 
w ji ( n ) e j ( n ) y j ( n ) v j ( n ) w ji ( n )
BP Algorithm
Approxim. •We calculate the different partial derivatives as
follows:
Model Selec.
( n )
BP & Opt.  e j (n)
e j ( n )
Conclusions
e j ( n )
 1
y j
(n)

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Learning Algorithm V

•And,
Contents
y j ( n )
MLP Model   j ' ( v j ( n ))
v j ( n )
BP Algorithm
Approxim. v j ( n )
 yi (n)
Model Selec. w ji ( n )

BP & Opt.
•Combining all the previous equations we get finally:
Conclusions

( n )
w ij ( n )    e j ( n ) j ' ( v j ( n )) y i ( n )
w ji ( n )

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Learning Algorithm VI

•The equation regarding the weight corrections can be


Contents written as:
MLP Model
w ji (n)   j ( n ) y i ( n )
BP Algorithm
Where j(n) is defined as the local gradient and is given
Approxim.
by:
Model Selec.
( n ) ( n ) e j ( n ) y j ( n )
BP & Opt.  j (n)    e j ( n ) j ' ( v j ( n ))
v j ( n ) e j ( n ) y j ( n ) v j ( n )
Conclusions
•We need to distinguish two cases:
• j is an output neuron
• j is a hidden neuron

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Learning Algorithm VII

•Thus the Back-Propagation algorithm is an error-


Contents correction algorithm for supervised learning.
MLP Model

BP Algorithm •If j is an output neuron, we have already a definition of


ej(n), so, j(n) is defined (after substitution) as:
Approxim.

Model Selec.  j
(n)  (d j
(n)  y j ( n ))  j ' ( v j ( n ))

BP & Opt.
•If j is a hidden neuron then j(n) is defined as:
Conclusions
( n ) y j ( n ) ( n )
 (n)    j ' ( v j ( n ))
y j ( n ) v j ( n ) y j ( n )
j

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Learning Algorithm VIII

•To calculate the partial derivative of E(n) wrt to yj(n)


Contents
we remember the definition of E(n) and we change the
MLP Model index for the output neuron to k, i.e.
BP Algorithm 1
( n )  
2
ek (n)
Approxim. 2 k C

Model Selec.
•Then we have:
BP & Opt.
( n ) e k ( n )
Conclusions   ek ( n )
y j ( n ) k C y j ( n )

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Learning Algorithm IX

•We use again the chain rule of differentiation to get


Contents the partial derivative of ek(n) wrt yj(n):
MLP Model
( n ) e k ( n ) v k ( n )
BP Algorithm
  ek ( n )
y j ( n ) k C v k ( n ) y j ( n )
Approxim.
•Remembering the definition of ek(n) we have:
Model Selec.
BP & Opt. ek ( n )  d k ( n )  y k ( n )  d k ( n )   k ( v v ( n ))
Conclusions
•Hence:
e k ( n )
  k ' ( v k ( n ))
v k ( n )

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Learning Algorithm X

•The local field vk(n) is defined as:


Contents
m

MLP Model vk (n)   w kj ( n ) y j ( n )


j 0
BP Algorithm
Approxim. Where m is the number of neurons (from the previous
layer) which connect to neuron k. Thus we get:
Model Selec.
BP & Opt. v k ( n )
 w kj ( n )
Conclusions
y j ( n )
•Hence:
( n )
   e k ( n )k ' ( v k ( n )) w kj ( n )
y j ( n ) k C

    k ( n ) w kj ( n )
k C

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Learning Algorithm XI

•Putting all together we find for the local gradient of a


Contents hidden neuron j the following formula:
MLP Model
 j
(n)   j ' ( v j ( n ))   k ( n ) w kj ( n )
BP Algorithm k C
•It is useful to remember the special form of the
Approxim.
derivatives for the logistic and hyperbolic tangent
Model Selec. sigmoids:
BP & Opt. • j’(vj(n))=yj(n)[1-yj(n)] (Logistic)
Conclusions • j’(vj(n))=[1-yj(n)][1+yj(n)] (Hyp. Tangent)

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Summary of BP Algorithm

1. Initialisation: Assuming that no prior infromation is


Contents available, pick the synaptic weights and thresholds
MLP Model from a uniform distribution whose mean is zero
and whose variance is chosen to make the std of
BP Algorithm the local fields of the neurons lie at the transition
Approxim. between the linear and saturated parts of the
sigmoid function
Model Selec.
2. Presentation of training examples: Present the
BP & Opt. network with an epoch of training examples. For
Conclusions each example in the set, perform the sequence of
the forward and backward computations described
in points 3 & 4 below.

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Summary of BP Algorithm II

3. Forward Computation:
Contents • Let the training example in the epoch be
MLP Model denoted by (x(n),d(n)), where x is the input
vector and d is the desired vector.
BP Algorithm
• Compute the local fields by proceeding forward
Approxim. through the network layer by layer. The local
Model Selec. field for neuron j at layer l is defined as:
BP & Opt. m

  w ji
(l ) (l ) ( l 1 )
Conclusions vj (n) (n) yi (n)
where m is the number i  0 of neurons which connect

to j and yi(l-1)(n) is the activation of neuron i at


layer (l-1). Wji(l)(n) is the weight

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Summary of BP Algorithm III

which connects the neurons j and i.


Contents
• For i=0, we have y0(l-1)(n)=+1 and wj0(l)(n)=bj(l)(n)
MLP Model is the bias of neuron j.
BP Algorithm • Assuming a sigmoid function, the output signal
Approxim.
of the neuron j is:
  j (v j
(l ) (l )
Model Selec. y j
(n) ( n ))

BP & Opt.
• If j is in the input layer we simply set:
Conclusions
 x j (n)
(0)
yj (n)

where xj(n) is the jth component of the input


vector x.

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Summary of BP Algorithm IV

• If j is in the output layer we have:


Contents
 o j (n)
(L)
y (n)
MLP Model j

BP Algorithm where oj(n) is the jth component of the output


Approxim. vector o. L is the total number of layers in the
network.
Model Selec.
• Compute the error signal:
BP & Opt.

Conclusions
e j (n)  d j (n)  o j (n)
where dj(n) is the desired response for the jth
element.

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Summary of BP Algorithm V

4. Backward Computation:
Contents
• Compute the s of the network defined by:
MLP Model
 e j ( n ) j ' ( v j ( n )) for
(L) (L)
neuron j in output layer L
BP Algorithm  j ( l ) ( n )  
 j ' ( v j ( l ) ( n ))   k ( l 1) ( n ) w kj ( l 1) ( n ) for neuron j in hidden layer l

Approxim. k

where j(•) is the derivative of function j wrt the


Model Selec.
argument.
BP & Opt.
• Adjust the weights using the generalised delta
Conclusions rule:
 w ji  1)   j
( l 1 )
w ji
(l ) (l ) (l )
(n) (n (n) yi (n)
where  is the momentum constant

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Summary of BP Algorithm VI

5. Iteration: Iterate the forward and backward


Contents computations of steps 3 & 4 by presenting new
epochs of training examples until the stopping
MLP Model
criterion is met.
BP Algorithm
Approxim.
• The order of presentation of examples should be
Model Selec. randomised from epoch to epoch
BP & Opt. • The momentum and the learning rate parameters
Conclusions typically change (usually decreased) as the number
of training iterations increases.

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Stopping Criteria

• The BP algorithm is considered to have converged


Contents when the Euclidean norm of the gradient vector
reaches a sufficiently small gradient threshold.
MLP Model
• The BP is considered to have converged when the
BP Algorithm
absolute value of the change in the average square
Approxim. error per epoch is sufficiently small
Model Selec.
BP & Opt.

Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009


XOR Example

• The XOR problem is defined by the following truth


Contents table:
MLP Model

BP Algorithm
Approxim.

Model Selec. • The following network solves the problem. The


BP & Opt. perceptron could not do this. (We use Sgn func.)
Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Heuristics for Back-Propagation

• To speed the convergence of the back-propagation


Contents algorithm the following heuristics are applied:
MLP Model • H1: Use sequential (online) vs batch update
BP Algorithm • H2: Maximise information content
Approxim. • Use examples that produce largest error
Model Selec. • Use example which very different from all the
BP & Opt.
previous ones

Conclusions • H3: Use an antisymmetric activation function,


such as the hyperbolic tangent. Antisymmetric
means:
(-x)=- (x)

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Heuristics for Back-Propagation II

• H4: Use different target values inside a smaller


Contents range, different from the asymptotic values of
the sigmoid
MLP Model

BP Algorithm
• H5: Normalise the inputs:
• Create zero-mean variables
Approxim.
• Decorrelate the variables
Model Selec.
• Scale the variables to have covariances
BP & Opt.
approximately equal
Conclusions
• H6: Initialise properly the weights. Use a zero
mean distribution with variance of:

 w

1
m

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Heuristics for Back-Propagation III

where m is the number of connections arriving to


Contents a neuron
MLP Model • H7: Learn from hints
BP Algorithm • H8: Adapt the learning rates appropriately (see
Approxim. next section)

Model Selec.
BP & Opt.

Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Heuristics for Learning Rate

• R1: Every adjustable parameter should have its


Contents own learning rate
MLP Model • R2: Every learning rate should be allowed to
BP Algorithm
adjust from one iteration to the next

Approxim. • R3: When the derivative of the cost function wrt


a weight has the same algebraic sign for several
Model Selec. consecutive iterations of the algorithm, the
BP & Opt. learning rate for that particular weight should be
increased.
Conclusions
• R4: When the algebraic sign of the derivative
above alternates for several consecutive
iterations of the algorithm the learning rate
should be decreased.

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Approximation of Functions

•Q: What is the minimum number of hidden layers in a


Contents MLP that provides an approximate realisation of any
MLP Model continuous mapping?
BP Algorithm
Approxim. •A: Universal Approximation Theorem
Model Selec. Let (•) be a nonconstant, bounded, and monotone
BP & Opt.
increasing continuous function. Let Im0 denote the m0-
dimensional unit hypercube [0,1]m0. The space of
Conclusions
continuous functions on Im0 is denoted by C(Im0). Then
given any function f  C(Im0) and  > 0, there exists an
integer m1 and sets of real constants ai , bi and wij where
i=1,…, m1 and j=1,…, m0 such that we may

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Approximation of Functions II

define:
Contents m1
m 
)   a i    w ij x j  b i 
0

MLP Model F ( x1 ,..., x m


 
 j 1 
0
i 1
BP Algorithm
as an approximate realisation of function f(•); that is:
Approxim.

Model Selec. | F ( x1 ,..., x m )  f ( x1 ,..., x m ) | 


0 0

BP & Opt.
for all x1, …, xm0 that lie in the input space.

Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Approximation of Functions III

•The Universal Approximation Theorem is directly


Contents applicable to MLPs. Specifically:
MLP Model • The sigmoid functions cover the requirements for
BP Algorithm function 
Approxim. • The network has m0 input nodes and a single hidden
Model Selec. layer consisting of m1 neurons; the inputs are
BP & Opt.
denoted by x1, …, xm0

Conclusions • Hidden neuron I has synaptic weights wi1, …, wm0


and bias bi
• The network output is a linear combination of the
outputs of the hidden neurons, with a1 ,…, am1
defining the synaptic weights of the output layer
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Approximation of Functions IV

•The theorem is an existence theorem: It does not tell


Contents us exactly what is the number m1; it just says that exists!
MLP Model !!
BP Algorithm •The theorem states that a single hidden layer is
Approxim. sufficient for an MLP to compute a uniform 
approximation to a given training set represented by
Model Selec. the set of inputs x1, …, xm0 and a desired output f(x1, …,
BP & Opt. xm0).
Conclusions
•The theorem does not say however that a single
hidden layer is optimum in the sense of the learning
time, ease of implementation or generalisation.

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Approximation of Functions V

•Empirical knowledge shows that the number of data


Contents pairs that are needed in order to achieve a given error
MLP Model level  is:
BP Algorithm W 
N O 
Approxim.   

Model Selec. Where W is the total number of adjustable parameters


BP & Opt.
of the model. There is mathematical support for this
observation (but we will not analyse this further!)
Conclusions
•There is the “curse of dimensionality” for
approximating functions in high-dimensional spaces.
•It is theoretically justified to use two hidden layers.

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Generalisation

Def: A network generalises well when the input-output


Contents mapping computed by the network is correct (or nearly
MLP Model so) for test data never used in creating or training the
network. It is assumed that the test data are drawn
BP Algorithm form the population used to generate the training data.
Approxim.

Model Selec.
•We should try to approximate the true mechanism that
BP & Opt. generates the data; not the specific structure of the
Conclusions
data in order to achieve the generalisation. If we learn
the specific structure of the data we have overfitting or
overtraining.

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Generalisation II

Contents
MLP Model

BP Algorithm
Approxim.

Model Selec.
BP & Opt.

Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Generalisation III

•To achieve good generalisation we need:


Contents
• To have good data (see previous slides)
MLP Model
• To impose smoothness constraints on the function
BP Algorithm
• To add knowledge we have about the mechanism
Approxim.
• Reduce / constrain model parameters:
Model Selec.
• Through cross-validation
BP & Opt.
• Through regularisation (Pruning, AIC, BIC, etc)
Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Cross Validation

•In cross validation method for model selection we split


Contents the training data to two sets:
MLP Model • Estimation set
BP Algorithm • Validation set
Approxim. •We train our model in the estimation set.
Model Selec. •We evaluate the performance in the validation set.
BP & Opt. •We select the model which performs “best” in the
Conclusions validation set.

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Cross Validation II

•There are variations of the method depending on the


Contents partition of the validation set. Typical variants are:
MLP Model • Method of early stopping
BP Algorithm • Leave k-out
Approxim.

Model Selec.
BP & Opt.

Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Method of Early Stopping

•Apply the method of early stopping when the number


Contents of data pairs, N, is less than N<30W, where W is the
number of free parameters in the network.
MLP Model
•Assume that r is the ratio of the training set which is
BP Algorithm
allocated to the validation. It can be shown that the
Approxim. optimal value of this parameter is given by:
Model Selec.
2W 1 1
BP & Opt. r opt 1
2 ( W  1)
Conclusions
•The method works as follows:
• Train in the usual way the network using the data in
the estimation set

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Method of Early Stopping II

• After a period of estimation, the weights and bias


Contents levels of MLP are all fixed and the network is
operating in its forward mode only. The validation
MLP Model
error is measured for each example present in the
BP Algorithm validation subset
Approxim. • When the validation phase is completed, the
Model Selec. estimation is resumed for another period (e.g. 10
epochs) and the process is repeated
BP & Opt.

Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Leave k-out Validation

•We divide the set of available examples into K subsets


Contents
•The model is trained in all the subsets except for one
MLP Model
and the validation error is measured by testing it on the
BP Algorithm subset left out
Approxim. •The procedure is repeated for a total of K trials, each
Model Selec. time using a different subset for validation
BP & Opt. •The performance of the model is assessed by
averaging the squared error under validation over all the
Conclusions trials of the experiment
•There is a limiting case for K=N in which case the
method is called leave-one-out.

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Leave k-out Validation II

•An example with K=4 is shown below


Contents
MLP Model

BP Algorithm
Approxim.

Model Selec.
BP & Opt.

Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Network Pruning

•To solve real world problems we need to reduce the


Contents
free parameters of the model. We can achieve this
MLP Model objective in one of two ways:
BP Algorithm • Network growing: in which case we start with a
Approxim.
small MLP and then add a new neuron or layer of
hidden neurons only when we are unable to achieve
Model Selec. the performance level we want
BP & Opt. • Network pruning: in this case we start with a large
Conclusions MLP with an adequate performance for the
problem at hand, and then we prune it by weakening
or eliminating certain weights in a principled
manner

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Network Pruning II

•Pruning can be implemented as a form of


Contents
regularisation
MLP Model

BP Algorithm
Approxim.

Model Selec.
BP & Opt.

Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Regularisation

•In model selection we need to balance two needs:


Contents
• To achieve good performance, which usually leads
MLP Model
to a complex model
BP Algorithm
• To keep the complexity of the model manageable
Approxim. due to practical estimation difficulties and the
Model Selec. overfitting phenomenon
BP & Opt. •A principled approach to the counterbalance both
needs is given by regularisation theory.
Conclusions
•In this theory we
assume that the estimation of the
model takes place using the usual cost function and a
second term which is called complexity penalty:

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Regularisation II

R(w)=Es(w)+Ec(w)
Contents
MLP Model Where R is the total cost function, Es is the standard
BP Algorithm
performance measure, Ec is the complexity penalty and
>0 is a regularisation parameter
Approxim.

Model Selec.
•Typically one imposes smoothness constraints as a
complexity term. I.e. we want to co-minimise the
BP & Opt. smoothing integral of the kth-order:
Conclusions  1 k   2  
 c ( w , k )   ||  k F ( x , w ) ||  ( x ) d x
2 x

Where F(x,w) is the function performed by the model


and (x) is some weighting function which determines

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Regularisation III

the region of the input space where the function F(x,w)


Contents
is required to be smooth.
MLP Model

BP Algorithm
Approxim.

Model Selec.
BP & Opt.

Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Regularisation IV

•Other complexity penalty options include:


Contents
• Weight Decay:
MLP Model
  W

 c ( w ) || w ||   w i
2 2
BP Algorithm
Approxim. Where W is the total numberi of
1
all free parameters in
the model
Model Selec.
• Weight Elimination:
BP & Opt.

Conclusions  W
( wi / w0 )
2

c ( w )  
1  ( wi / w0 )
2
i 1

Where w0 is a pre-assigned parameter

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Regularisation V

•There are other methods which base their decision on


Contents
which weights to eliminate on the Hessian, H
MLP Model
•For example:
BP Algorithm
• The optimal brain damage procedure (OBD)
Approxim.
• The optimal brain surgeon procedure (OBS)
Model Selec.
• In this case a weight, wi, is eliminated when:
BP & Opt.

Conclusions
Eav < Si

Where Si is defined as:


2
wi
Si  1
2[ H ] i ,i

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Conjugate-Gradient Method

•The conjugate-gradient method is a 2nd order


Contents optimisation method, i.e. we assume that we can
MLP Model
approximate the cost function up to second degree in
the Taylor series:
BP Algorithm
 T  T 
1
Approxim. f (x)  x Ax  b x  c
2
Model Selec.
Where A and b are appropriate matrix and vector and x
BP & Opt.
is a W-by-1 vector
Conclusions
•We can find the minimum point by solving the
equations:
x* = A-1b

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Conjugate-Gradient Method II

•Given the matrix A we say that a set of nonzero


Contents vectors s(0), …, s(W-1) is A-conjugate if the following
condition holds:
MLP Model

BP Algorithm sT(n)As(j)=0 ,  n and j, nj


Approxim.

Model Selec. •If A is the identity matrix, conjugacy is the same as


orthogonality.
BP & Opt.
Conclusions
•A-conjugate vectors are linearly independent

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Summary of the Conjugate-Gradient Method

1. Initialisation: Unless prior knowledge on the weight


Contents vector w is available, choose the initial value w(0)
using a procedure similar to the ones which are
MLP Model
used for the BP algorithm
BP Algorithm
2. Computation:
Approxim.
1. For w(0), use the BP to compute the gradient vector
Model Selec. g(0)
BP & Opt. 2. Set s(0)=r(0)=-g(0)
Conclusions 3. At time step n, use a line search to find (n) that
minimises Eav(n) sufficiently, representing the cost
function Eav expressed as a function of  for fixed
values of w and s

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Summary of the Conjugate-Gradient Method II

4. Test to determine if the Euclidean norm of the


Contents residual r(n) has fallen below a specific value, that is,
a small fraction of the initial value ||r(0)||
MLP Model
5. Update the weight vector:
BP Algorithm
w(n+1)=w(n)+ (n) s(n)
Approxim.
6. For w(n+1), use the BP to compute the updated
Model Selec. gradient vector g(n+1)
BP & Opt. 7. Set r(n+1)=-g(n+1)
Conclusions
8. Use the Polak-Ribiere formula to calculate (n+1):
T  
r (n  1)[ r ( n  1)  r ( n )] 
 ( n  1)  max  T  , 0
 r (n)r (n) 

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Summary of the Conjugate-Gradient Method III

9. Update the direction vector:


Contents
s(n+1)=r(n+1)+ (n+1)s(n)
MLP Model 10. Set n=n+1 and go to step 3
BP Algorithm
3. Stopping Criterion: Terminate the algorithm when
Approxim. the following condition is satisfied:
Model Selec. ||r(n)||   ||r(0)||
BP & Opt. Where  is a prescribed small number
Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Advantages & Disadvantages

•MLP and BP is used in Cognitive and Computational


Contents Neuroscience modelling but still the algorithm does not
MLP Model have real neuro-physiological support
•The algorithm can be used to make encoding /
BP Algorithm
decoding and compression systems. Useful for data
Approxim. pre-processing operations
Model Selec. •The MLP with the BP algorithm is a universal
approximator of functions
BP & Opt.
•The algorithm is computationally efficient as it has
Conclusions O(W) complexity to the model parameters
•The algorithm has “local” robustness
•The convergence of the BP can be very slow,
especially in large problems, depending on the method

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Advantages & Disadvantages II

•The BP algorithm suffers from the problem of local


Contents minima
MLP Model

BP Algorithm
Approxim.

Model Selec.
BP & Opt.

Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009


Types of problems

•The BP algorithm is used in a great variety of problems:


Contents • Time series predictions
MLP Model • Credit risk assessment
BP Algorithm • Pattern recognition
• Speech processing
Approxim.
• Cognitive modelling
Model Selec.
• Image processing
BP & Opt.
• Control
Conclusions • Etc
•BP is the standard algorithm against which all other
NN algorithms are compared!!

CS 476: Networks of Neural Computation, CSD, UOC, 2009

You might also like