Artificial Neural
Artificial Neural
, called a weg|,
which is adapted duiing leaining.
The PEs in the MLP aie composed of an addei followed by a smooth satuiating nonlineaiity of the sigmoid
type (Fig. 20.3). The most common satuiating nonlineaiities aie the logistic function and the hypeibolic
tangent. The thieshold is used in othei nets. The impoitance of the MLP is that it is a univeisal mappei
(implements aibitiaiy input/output maps) when the topology has at least two hidden layeis and suffcient
numbei of PEs Haykin, 1994]. Even MLPs with a single hidden layei aie able to appioximate continuous
input/output maps. This means that iaiely we will need to choose topologies with moie than two hidden layeis.
But these aie existence pioofs, so the issue that we must solve as engineeis is to choose how many layeis and
how many PEs in each layei aie iequiied to pioduce good iesults.
Many pioblems in engineeiing can be thought of in teims of a tiansfoimation of an input space, containing
the input, to an output space wheie the desiied iesponse exists. Foi instance, dividing data into classes can be
thought of as tiansfoiming the input into 0 and 1 iesponses that will code the classes Bishop, 1995]. Likewise,
identifcation of an unknown system can also be fiamed as a mapping (function appioximation) fiom the input
to the system output Kung, 1993]. The MLP is highly iecommended foi these applications.
FIGURE 20.2 MLP with one hidden layei (J-|-m).
FIGURE 20.3 A PE and the most common nonlineaiities.
2000 by CRC Press LLC
Functiun ul Each PE
Let us study biiey the function of a single PE with two inputs Zuiada, 1992]. If the nonlineaiity is the
thieshold nonlineaiity we can immediately see that the output is simply 1 and -1. The suiface that divides
these subspaces is called a searaon sur[ate, and in this case it is a line of equation
(20.1)
i.e., the PE weights and the bias contiol the oiientation and position of the sepaiation line, iespectively
(Fig. 20.4). In many dimensions the sepaiation suiface becomes an hypeiplane of dimension one less than the
dimensionality of the input space. So, each PE cieates a dichotomy in the input space. Foi smooth nonlineaiities
the sepaiation suiface is not ciisp; it becomes fuzzy but the same piinciples apply. In this case, the size of the
weights contiols the width of the fuzzy boundaiy (laigei weights shiink the fuzzy boundaiy).
The peiception input/output map is built fiom a juxtaposition of lineai sepaiation suifaces, so the peiception
gives zeio classifcation eiioi only foi |near|y seara||e t|asses (i.e., classes that can be exactly classifed by
hypeiplanes).
When one adds one layei to the peiception cieating a one hidden layei MLP, the type of sepaiation suifaces
changes diastically. It can be shown that this leaining machine is able to cieate bumps" in the input space,
i.e., an aiea of high iesponse suiiounded by low iesponses Zuiada, 1992]. The function of each PE is always
the same, no mattei if the PE is pait of a peiception oi an MLP. Howevei, notice that the output layei in the
MLP woiks with the iesult of hidden layei activations, cieating an embedding of functions and pioducing moie
complex sepaiation suifaces. The one-hidden-layei MLP is able to pioduce non|near searaon sur[ates.
If one adds an extia layei (i.e., two hidden layeis), the leaining machine now can combine at will bumps,
which can be inteipieted as a unersa| maer, since theie is evidence that any function can be appioximated
by localized bumps. One impoitant aspect to iemembei is that changing a single weight in the MLP can
diastically change the location of the sepaiation suifaces; i.e., the MLP achieves the input/output map thiough
the inteiplay of all its weights.
Huv tu Train MLPs
One fundamental issue is how to adapt the weights w
+
, ,
, ,
V
, ,
V 1 q
o
o
V
, ,
, , , , , ,
, , , ,
J |
w
J |
w
| | x |
o
o
o
o
r r ~
1
2
2
o
o
o
o
o
o
o
o
o
o
r
J
w
J
y
y
w y
J y
w
w x x
, ,
_
,
, ,
2
2000 by CRC Press LLC
(20.5)
wheie [ (net) is the deiivative of the nonlineaiity computed at the opeiating point. Equation (20.5) is known
as the Je|a ru|e, and it will tiain the peiception Haykin, 1994]. Note that thioughout the deiivation we skipped
the pattein index foi simplicity, but this iule is applied foi each input pattein. Howevei, the delta iule cannot
tiain MLPs since it iequiies the knowledge of the eiioi signal at each PE.
The piinciple of the oideied deiivatives can be extended to multilayei netwoiks, piovided we oiganize the
computations in ows of activation and eiioi piopagation. The piinciple is veiy easy to undeistand, but a little
complex to foimulate in equation foim Haykin, 1994].
Suppose that we want to adapt the weights connected to a hidden layei PE, the th PE (Fig. 20.7). One can
decompose the computation of the paitial deiivative of the cost with iespect to the weight w
,
as
(20.6)
i.e., the paitial deiivative with iespect to the weight is the pioduct of the paitial deiivative with iespect to the
PE state - pait 1 in Eq. (20.6) - times the paitial deiivative of the local activation to the weights - pait 2
in Eq. (20.6). This last quantity is exactly the same as foi the nonlineai PE ([ (net
)x
,
), so the big issue is the
computation of . Foi an output PE, becomes the injected eiioi r in Eq. (20.4). Foi the hidden th PE
is evaluated by summing all the eiiois that ieach the PE fiom the top layei thiough the topology when the
injected eiiois r
|
aie clamped at the top layei, oi in an equation
(20.7)
Substituting back in Eq. (20.6) we fnally get
(20.8)
FIGURE 20.7 How to adapt the weights connected to th PE.
o
o
o
o
o
o
o
o
r
J
w
J
y
y
w
ne J y [ x [ x
, ,
, ,
, ,
net
net net
o
o
o
o
o
o
o
o
J
w
J
y
y
w
,
net
net
1 2
o
o
J
y
o
o
J
y
o
o
J
y
o
o
o
o
o
o
o
o
r
J
y
J
y
y
y
[ w
|
|
|
|
| | | |
|
_
,
, ,
net
net net
o
o
r
J
w
x [ [ w
,
, | | |
|
, , , ,
_
,
net net
1 2
2000 by CRC Press LLC
This equation embodies the |at|-roagaon ranng a|gor|m Haykin, 1994; Bishop, 1995]. It can be
iewiitten as the pioduct of a local activation (pait 1) and a local eiioi (pait 2), exactly as the LMS and the
delta iules. But now the local eiioi is a composition of eiiois that ow thiough the topology, which becomes
equivalent to the existence of a desiied iesponse at the PE.
Theie is an intiinsic ow in the implementation of the back-piopagation algoiithm: fist, inputs aie applied
to the net and activations computed eveiywheie to yield the output activation. Second, the exteinal eiiois aie
computed by subtiacting the net output fiom the desiied iesponse. Thiid, these exteinal eiiois aie utilized in
Eq. (20.8) to compute the local eiiois foi the layei immediately pieceding the output layei, and the computations
chained up to the input layei. Once all the local eiiois aie available, Eq. (20.2) can be used to update eveiy
weight. These thiee steps aie then iepeated foi othei tiaining patteins until the eiioi is acceptable.
Step thiee is equivalent to injecting the exteinal eiiois in the Jua| oo|ogy and back-piopagating them up
to the input layei Haykin, 1994]. The dual topology is obtained fiom the oiiginal one by ieveising data ow
and substituting summing junctions by splitting nodes and vice veisa. The eiioi at each PE of the dual topology
is then multiplied by the activation of the oiiginal netwoik to compute the weight updates. So, effectively the
dual topology is being used to compute the local eiiois which makes the pioceduie highly effcient. This is the
ieason back-piopagation tiains a netwoik of N weights with a numbei of multiplications piopoitional to N,
(O(N)), instead of (O(N
2
)) foi pievious methods of computing paitial deiivatives known in contiol theoiy.
Using the dual topology to implement back-piopagation is the best and most geneial method to piogiam the
algoiithm in a digital computei.
App!ying Back-Prupagatiun in Practice
Now that we know an algoiithm to tiain MLPs, let us see what aie the piactical issues to apply it. We will
addiess the following aspects: size of tiaining set vs. weights, seaich pioceduies, how to stop tiaining, and how
to set the topology foi maximum geneialization.
Size ul Training Set
The size of the tiaining set is veiy impoitant foi good peifoimance. Remembei that the ANN gets its infoimation
fiom the tiaining set. If the tiaining data do not covei the full iange of opeiating conditions, the system may
peifoim badly when deployed. Undei no ciicumstances should the tiaining set be less than the numbei of
weights in the ANN. A good size of the tiaining data is ten times the numbei of weights in the netwoik, with
the lowei limit being set aiound thiee times the numbei of weights (these values should be taken as an indication,
subject to expeiimentation foi each case) Haykin, 1994].
Search Prucedures
Seaiching along the diiection of the giadient is fne if the peifoimance suiface is quadiatic. Howevei, in ANNs
iaiely is this the case, because of the use of nonlineai PEs and topologies with seveial layeis. So, giadient descent
can be caught in local minima, which makes the seaich veiy slow in iegions of small cuivatuie. One effcient
way to speed up the seaich in iegions of small cuivatuie and, at the same time, to stabilize it in naiiow valleys
is to include a momentum teim in the weight adaptation
(20.9)
The value of momentum o should be set expeiimentally between 0.5 and 0.9. Theie aie many moie modif-
cations to the conventional giadient seaich, such as adaptive step sizes, annealed noise, conjugate giadients,
and second-oidei methods (using infoimation contained in the Hessian matiix), but the simplicity and powei
of momentum leaining is haid to beat Haykin, 1994; Bishop, 1995].
Huv tu Stup Training
The stop ciiteiion is a fundamental aspect of tiaining. The simple ideas of capping the numbei of iteiations
oi of letting the system tiain until a piedeteimined eiioi value aie not iecommended. The ieason is that we
want the ANN to peifoim well in the test set data; i.e., we would like the system to peifoim well in data it
w n w n n x n w n w n
, , , , ,
+
, ,
, ,
+
, , , ,
+
, ,
, , , ,
1 1 qo o
2000 by CRC Press LLC
nevei saw befoie (good genera|:aon) Bishop, 1995]. The eiioi in the tiaining set tends to deciease with
iteiation when the ANN has enough degiees of fieedom to iepiesent the input/output map. Howevei, the
system may be iemembeiing the tiaining patteins (oerfng) instead of fnding the undeilying mapping iule.
This is called oerranng. To avoid oveitiaining the peifoimance in a a|Jaon se, i.e., a set of input data
that the system nevei saw befoie, must be checked iegulaily duiing tiaining (i.e., once eveiy 50 passes ovei the
tiaining set). The tiaining should be stopped when the peifoimance in the validation set staits to inciease,
despite the fact that the peifoimance in the tiaining set continues to deciease. This method is called tross
a|Jaon. The validation set should be 10% of the tiaining set, and distinct fiom it.
Size ul the Tupu!ugy
The size of the topology should also be caiefully selected. If the numbei of layeis oi the size of each layei is
too small, the netwoik does not have enough degiees of fieedom to classify the data oi to appioximate the
function, and the peifoimance suffeis.
On the othei hand, if the size of the netwoik is too laige, peifoimance may also suffei. This is the phenomenon
of oerfng that we mentioned above. But one alteinative way to contiol it is to ieduce the size of the netwoik.
Theie aie basically two pioceduies to set the size of the netwoik: eithei one staits small and adds new PEs oi
one staits with a laige netwoik and piunes PEs Haykin, 1994]. One quick way to piune the netwoik is to
impose a penalty teim in the peifoimance function - a regu|ar:ng erm - such as limiting the slope of the
input/output map Bishop, 1995]. A iegulaiization teim that can be implemented locally is
(20.10)
wheie i is the weg| Jetay paiametei and o the local eiioi. Weight decay tends to diive unimpoitant weights
to zeio.
A Posteriori Prubabi!ities
We will fnish the discussion of the MLP by noting that this topology when tiained with the mean squaie
eiioi is able to estimate diiectly at its outputs a oseror piobabilities, i.e., the piobability that a given input
pattein belongs to a given class Bishop, 1995]. This piopeity is veiy useful because the MLP outputs can be
inteipieted as piobabilities and opeiated as numbeis. In oidei to guaiantee this piopeity, one has to make suie
that each class is attiibuted to one output PE, that the topology is suffciently laige to iepiesent the mapping,
that the tiaining has conveiged to the absolute minimum, and that the outputs aie noimalized between 0 and
1. The fist iequiiements aie met by good design, while the last can be easily enfoiced if the so[max ataon
is used as the output PE Bishop, 1995],
(20.11)
20.3 Radia! Basis Functiun Netvurks
The iadial basis function (RBF) netwoik constitutes anothei way of implementing aibitiaiy input/output
mappings. The most signifcant diffeience between the MLP and RBF lies in the PE nonlineaiity. While the PE
in the MLP iesponds to the full input space, the PE in the RBF is local, noimally a Gaussian keinel in the input
space. Hence, it only iesponds to inputs that aie close to its centei; i.e., it has basically a |ota| resonse.
w n w n
w n
n x n
, ,
,
,
+
, ,
, ,
+
, , , ,
_
,
+
, , , ,
1 1
1
2
i
qo
y
,
,
, ,
, ,
exp
exp
net
net
2000 by CRC Press LLC
The RBF netwoik is also a layeied net with the hidden layei built fiom Gaussian keinels and a lineai (oi
nonlineai) output layei (Fig. 20.8). Tiaining of the RBF netwoik is done noimally in two stages Haykin, 1994]:
fist, the centeis x
aie adaptively placed in the input space using competitive leaining oi | means clusteiing
Bishop, 1995], which aie unsupeivised pioceduies. Competitive leaining is explained latei in the chaptei. The
vaiiances of each Gaussian aie chosen as a peicentage (30 to 50%) to the distance to the neaiest centei. The
goal is to covei adequately the input data distiibution. Once the RBF is located, the second layei weights w
qx