Multi Layer Feed-Forward NN
Multi Layer Feed-Forward NN
FNN
Multi layer feed-forward NN
We consider a more general network architecture: between the input and output
layers there are hidden layers, as illustrated below.
Hidden nodes do not directly receive inputs nor send outputs to the external
environment.
FNNs overcome the limitation of single-layer NN: they can handle non-linearly
separable learning tasks.
Input Output
layer layer
Hidden Layer
Neural Networks NN 4 1
FNN
XOR problem
A typical example of non-linealy separable function is
the XOR. This function takes two input arguments with values in
{-1,1} and returns one output in {-1,1}, as specified in the following
table: x x x xor x
1 2 1 2
-1 -1 -1
-1 1 1
1 -1 1
1 1 -1
If we think at -1 and 1 as encoding of the truth values false and true,
respectively, then XOR computes the logical exclusive or,
which yields true if and only if the two inputs have different
truth values.
Neural Networks NN 4 2
Rossella Cancelliere 1
NN 4 11-00
FNN
XOR problem
In this graph of the XOR, input pairs giving output x1
equal to 1 and -1 are shown.
1
These two classes cannot be separated using a line.
We have to use two lines.
The following NN with two hidden nodes -1 1
realizes this non-linear separation, where x2
each hidden node describes one of the two
lines. -1
-1
0.1
+1
x1 +1
This NN uses the sign activation
-1 function. The two arrows
indicate the regions where the
-1 network output will be 1. The output
x2 +1
+1
node is used to combine the outputs
of the two hidden nodes.
Neural Networks -1 NN 4 3
FNN
Types of decision regions
1 Network
w0
with a single
w0 + w1 x1 + w2 x2 > 0 x1 w1 node
w0 + w1 x1 + w2 x2 < 0
x2 w2
Neural Networks NN 4 4
Rossella Cancelliere 2
NN 4 11-00
Increasing a
where v j = ∑ w ji yi
i
FNN
Training: Backprop algorithm
• The Backprop algorithm searches for weight values
that minimize the total error of the network over the set
of training examples (training set).
• Backprop consists of the repeated application of the
following two passes:
– Forward pass: in this step the network is activated on one
example and the error of (each neuron of) the output layer is
computed.
– Backward pass: in this step the network error is used for
updating the weights (credit assignment problem). This
process is more complex than the LMS algorithm for Adaline,
because hidden nodes are linked to the error not directly but
by means of the nodes of the next layer. Therefore, starting at
the output layer, the error is propagated backwards through
the network, layer by layer. This is done by recursively
computing the local gradient of each weight.
Neural Networks NN 4 6
Rossella Cancelliere 3
NN 4 11-00
Backprop FNN
Network activation
Forward Step
i k
w
ki
Error propagation
Backward Step
FNN
Total Mean Squared Error
• The error of output neuron j after the activation of the
network on the n-th training example ( x ( n ), d ( n )) is:
e j (n) = d j (n) - y j (n)
∑e
output neurons:
E(n) =
1 2
2 j (n)
j output node
• The total mean squared error is the average of the network
errors of the training examples.
N
∑ E (n)
1
E AV = N
n =1
Neural Networks NN 4 8
Rossella Cancelliere 4
NN 4 11-00
w ji = w ji + ∆w ji
∂E
∆w ji = -η η >0
∂w ji
Neural Networks NN 4 9
FNN
Weight Update Rule
Input of neuron j is: vj = ∑w
i =0 ,...,m
ji yi
Then from
∂v j
∂w ji
= y i we get ∆w ji = ηδ j yi
Neural Networks NN 4 10
Rossella Cancelliere 5
NN 4 11-00
FNN
Weight update of output neuron
In order to compute the weight change ∆w ji we need to know the error signal
δ j of neuron j .
There are two cases, depending whether j is an output or an hidden neuron.
If j is an output neuron then using the chain rule we obtain:
∂E ∂E ∂e ∂y
− = − j j
= − e j ( − 1 )ϕ ' ( v j )
∂v j ∂e j ∂y j ∂v j
because ej = dj - yj and y j = ϕ ( v j)
∆ w ji = η (d j - y j ) ϕ ' (v j ) y i
Neural Networks NN 4 11
FNN
Weight update of hidden neuron
If j is a hidden neuron then its error signal δ j is computed using the
error signals of all the neurons of the next layer.
∂E ∂E ∂y j
Using the chain rule we have: δ j = − =- =
∂v j ∂y j ∂v j
∂y ∂E ∂E ∂v k
Observe that
∂v
j
= ϕ ' ( v j ) and
∂y j
= ∑
k in next ∂v k ∂y j
j
layer
Then δj = −
k in next
∑ δ w
layer
k kj . ϕ '(v j)
Neural Networks NN 4 12
Rossella Cancelliere 6
NN 4 11-00
FNN
Summary: Delta Rule
• Delta rule ∆wji = ηδj yi
where ϕ ' ( v j ) = ay j (1 − y j )
Neural Networks NN 4 13
FNN
Generalized delta rule
• If η is small then the algorithm learns the weights very
slowly, while if η is large then the large changes of the
weights may cause an unstable behavior with
oscillations of the weight values.
• A technique for tackling this problem is the introduction
of a momentum term in the delta rule which takes into
account previous updates. We obtain the following
generalized Delta rule:
Rossella Cancelliere 7
NN 4 11-00
FNN
Other techniques:η adaptation
Neural Networks NN 4 15
Rossella Cancelliere 8
NN 4 11-00
∑ ∆w
the formula
w ji = w ji + x
ji
x training example
• The learning process continues on an epoch-
by-epoch basis until the stopping condition is
satisfied.
• In the incremental mode choose a randomized
ordering for selecting the examples in the
training set in order to avoid poor performance.
Neural Networks NN 4 17
FNN
Stopping criterions
• Sensible stopping criterions:
– total mean squared error change:
Back-prop is considered to have converged when the
absolute rate of change in the average squared error per
epoch is sufficiently small (in the range [0.1, 0.01]).
– generalization based criterion:
After each epoch the NN is tested for generalization. If the
generalization performance is adequate then stop. If this
stopping criterion is used then the part of the training set
used for testing the network generalization will not used for
updating the weights.
Neural Networks NN 4 18
Rossella Cancelliere 9
NN 4 11-00
FNN
NN DESIGN
The following features are very important for
NN design:
• Data representation
• Network Topology
• Network Parameters
• Training
• Validation
Neural Networks NN 4 19
FNN
Data Representation
• Data representation depends on the problem;
generally NNs work on continuous (real valued)
attributes.
• Attributes of different types may have different
ranges of values; this can affect the training
process. Normalization may be used, so that each
attribute assumes values between 0 and 1.
xi − min i
xi =
max i − min i
where min i and max i represent the range of that
attribute over the training set.
Neural Networks NN 4 20
Rossella Cancelliere 10
NN 4 11-00
FNN
Network Topology
• The number of layers and of neurons
depend on the specific task. In practice this
issue is solved by trial and error.
• Two types of adaptive algorithms can be
used:
– start from a large network and successively
remove some neurons and links until network
performance degrades (pruning).
– begin with a small network and introduce new
neurons until performance is satisfactory.
Neural Networks NN 4 21
FNN
Network parameters
Neural Networks NN 4 22
Rossella Cancelliere 11
NN 4 11-00
FNN
Weights and learning rate
• In general, initial weights are randomly
chosen, with typical values between -1.0
and 1.0 or -0.5 and 0.5.
Training FNN
• Rule of thumb:
– the number of training examples should be at
least four to ten times the number of weights of
the network.
• Other rule:
Neural Networks NN 4 24
Rossella Cancelliere 12
NN 4 11-00
Boolean functions:
• Every boolean function can be represented
by a network with a single hidden layer
Continuous functions:
• Every bounded piece-wise continuous
function can be approximated with arbitrarily
small error by a network with one hidden
layer.
• Any continuous function can be
approximated to arbitrary accuracy by a
network with two hidden layers.
Neural Networks NN 4 25
i.e. ( ) ( )
F x1 ,K x m0 − f x1 ,K x m0 < ε
Neural Networks NN 4 26
Rossella Cancelliere 13
NN 4 11-00
(
F x1,Kxm0 ) represents the output of a MLP with:
Neural Networks NN 4 27
FNN
Approximation by FNN - comments
Neural Networks NN 4 28
Rossella Cancelliere 14
NN 4 11-00
Rossella Cancelliere 15