0% found this document useful (0 votes)
2 views

Multilayer Perceptron Design Algorithm

This paper presents a design algorithm for calculating the number of hidden nodes and determining starting weights for Multilayer Perceptrons (MLP), which improves training efficiency and generalization. The method employs Singular Value Decomposition (SVD) to optimize the architecture of the MLP for binary classification tasks, leading to better performance compared to traditional approaches. Additionally, the algorithm is extended to a multistage structure for handling multiple classes, enhancing the capability to process transient signals effectively.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Multilayer Perceptron Design Algorithm

This paper presents a design algorithm for calculating the number of hidden nodes and determining starting weights for Multilayer Perceptrons (MLP), which improves training efficiency and generalization. The method employs Singular Value Decomposition (SVD) to optimize the architecture of the MLP for binary classification tasks, leading to better performance compared to traditional approaches. Additionally, the algorithm is extended to a multistage structure for handling multiple classes, enhancing the capability to process transient signals effectively.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

MuItilaye r Perceptron

Design Algorithm

Elizabeth Wilson Donald W. Tufts


Raytheon Company Kellcy Annex
Equipment Division EE Department
1001 Boston Post Road University of Rhode Island
Marlboro, MA 01752 Kingston, RI 02881
bwilson @ sud.2.ed.ray.com tufts @ cle.uri.edu
Phone: (508) 490- 1769
Fax: (508) 490-3007

May 1994

Abstract

This paper describes a design algorillini that has been developed to


calculate the number of hidden nodes required and compute a good set of
starling weights for thc Multilayer I'erceptron (MLP). Thcrc arc significant
advwtiigcs to being able to calculate the number o f hidden nodes required.
The proper choicc of [he number of hidden nodes rcsults in shorter iraining
tinics, Ixtlcr generalization, and sinipler coniputations in irnplcnicntation.

Tlii.; method is [lien used to design an efficient, effective MLP for multiple-
cla\s decision using these siniplificd binary-decision neural nelworks. The
resulting algorithmic structure has an cllicicnt pipclined implementation.
Sinidations describe thc application of the design algorithm and parametric
classification of a transient signal. A modified wavelet feature
rcpicscntrltion is introduced as an input to the neural networks associated
w i t h arrival time discrimination.

1 Introduction
Neural nctworks have been used to solvc a number of difficult problems and these
networks arc particularly useful when thc statistics associatcd with a mapping of
reccivcd inputs to a dcsired output arc not completely known and/or arc nonlinear.
Thc Multilayer Pcrccptron (MLP) yiclds a simplc I'ccdl'orward nctwork that
accomplishcxs this niapping.

Although a popular mcthod lor training the MLP, the backpropagation algorithm [ I 1
is orten criticized for thc length of time it Lakes to converge (if it converges) and the
potcntial cor scttling into a local instead of global minimum. There have been a
number o f tcchniques proposed to inprove the backpropagation algorithm by
optimizing paramcters, speeding up the gradient descent, pruning unnecessary
weights, a n d using clustering algorithms to define the structure.

0-7803-2026-3/94 $4.00 0 1994 IEEE 61


The gcncral mapping formulas in the literature generally lead to more nodes than
necessary and longer training times. The structure of the network is critical to
successful implementation, but often the number of hidden nodes in an MLP is
chosen arbitrarily and modified by trial and error. Too few nodes do not properly
characterizc the mapping and make convergence difficult. Too many nodes will
improve pcrformance on the training set but will reduce the ability of the-network to
generalize to new examples [2]. The design method described here computes the
number of hidden nodes required and the starting weights.

2. MLP Design Algorithm

2.1 Using Singular Value Decomposition (SVD) for Design


Let us consider the transrormation from an input vector to the set of the hidden-layer
node outputs a s an approximate projection from the input space onto a subspace.
From this point of vicw one suspects that the SVD can provide some insight into a
better starting point for weights in Multilayer Perceptron learning algorithms.
Refercncc 131 describes the motivation for using the SVD and the development of a
test slatistic [4) to consider the binary classification of a signal vector between two
subspaces in thc prcsence of white Gaussian noise. The hypotheses are:

Hi: Signal present in S: S + N (signal subspace)


Ho: Noise only A: N (alternate subspace)

The Gcncralized Likclihood Ratio Test can be written as

energy in
\
s energy i n A
I
thrihold
v
test rtotirtic

where Ps is thc projection opcrator for S and PA is the projection operator for A.

This test statistic is characterized in terms of the covariance matrices corresponding to


the signal subspace (represented by Rs) and thc alternative subspace (RA) and the
likelihood ratio test is written as

- test rtotirticT threshold

62
Wc uhc thc Iollo&ing SVD of thc covariance matrix product

KA.'RS = UCVT

whcrc U is ;I unitary matrix and C is the diagonal matrix of singular values of


RA-IRs. The test statistic is then writtcn as a win
N

j=1

whcrc d; arc thc diagonal clcmcnts of and uTj thc jth column of UT. This
rcprcsentation of the GLRT provides a useful tool for designing ;incl training the MLP
and compiiling a sliirting point l o r the wcighls.

2.2 Calculating the Number of Hidden Nodes

If the input-to-hi(ldcn-laycrwcighks arc vicwcd a s approximatcly pcrl'onning the same


projection opcration, the sainc covariance niatriccs can be uscd:

RS = E [SS']

whcrc S is ii inatrix composed of only cxamplcs froin thc H I class a n d A is a inatrix


composed 01. thc Ho class partcrns.

Thc S V D 01 thc corrclation matrix product:

yields a di;igon;il matrix E. Modeling the fallofl' of thc singular values as a simple
exponcntial curvc allows us to dctcrminc a rcprcscntativc time constant by laking the
natural log and rearranging tcrms. This valuc is approxirnatcly thc number of hidden
nodes ( N ) rcquircd. Rcgardlcss oT the tcchniquc uscd, thc rank N of C can bc takcn to
be thc neccssary nuinbcr of hidden nodes. The rank N is lcss than K whcrc K is the
dimension 01' the input anti the siLc 01' thc R square matrices).

This proccilurc has been succcssiul l o r a class o f problems whcrc thcrc is a single
output notic. For thc binary hypothcsis test, the f o r m of the likclihootl ratio shows
that only this onc output node is nccdcd. With thc input layer dcfincd by the input
data, thc hidden nodes calculatcd abovc, and thc. spccification of onc output, the
architccturc is now dcfincd.

63
Figure 1: Improved Performance Using Design Algorithm

2.3 Computing the Starting Weights

If only thc lirst N diagonal clements of C are used, thcn only the first N columns of U
are relevant. Thercforc, the matrix U in the SVD result (6) directly rcveals the
starting values for the input to hidden weights. Because U is a K x K square matrix,
taking thc I‘irst N columns rcsults in the weight matrix W (IH), which is K x N as
required:

The likelihood test statistic can be used LO define starting weights from the hidden
layer to thc onc output node. Using the diagonal elements of E, these starting values
are:
N
W(HO) = 1 dj
(8)
l=j

which produccs a matrix W (Ho) that is N x 1 as desircd.

64
m a l necwak f=8F
I

B -953
d"1'.
f=7F

r
f=6F
w
E f=5F

I
f=4F

neural &work
R
f=3F
if"1"

I
'-1- U
f=2F

f = 1F

Figure 2: Multistage Implementation for Transient Frequency


Feature Extraction

3. Transient Binary Classification Example

For the l'irsi binary classification examplc [5,61 wc tcsl for the presence 01' a signal
componciit is 21 prescribed cell in the time-frequency sub-rcgion.

Thc training set X contains examples of thc time-frequency phasc sampling and is
separated into an S matrix compriscd of the H I C ~ S C Sand an A matrix with the Ho
cxamplch. Computing the covariance matrix product and pcrforming the S V D
dcscribcd in Equation 6 and thc calculations dcscribcd in thc algorithm yiclds a value
of 4 for the numbcr of hiddcn nodes. Thc first 4 columns of U arc utilizcd as the
starting weights from the input to hiddcn laycrs. The first 3 singular valucs are
converted ;I\ shown in Equation X to thc starting wcights froin the hiddcn laycr to thc
output motlc.

The s;iiiic problem was addressed with the original training set and random weights.
A network ol' the same size would not corivcrgc using the PDP softwarc package 171.
Weights wc>rc added and cvcntually another laycr was added. After cxtcnsivc
attempts ;it training an MLP thc bcst pcrl'ormancc was realized with a nctwork with
two hiddcn laycrs (6 nodes in thc first hiddcn laycr and 4 ntdcs in thc second). Using
fcwe.r la) cr\ with fcwcr nodes in cach laycr yielded bcttcr pcrformancc.

65
Figure 3: Multistage Implementation for t=l and f=7 Example

The test blocks are applied to both networks and the false alarm, miss, correct HO, and
correct HI rates are computed. Figure 1 shows the resulting ROC curves plotting
PFA versus PD for the random start case (dashed line) and the design algorithm case
(solid line). Thc pcrformancc has becn improved significantly and the network now
has fewer hidden nodcs.

4 Extension to Multiple Classes

A multistage structure is constructed to cascade binary neural networks [8]. This


allows the use of smaller and simpler networks which provide for more efficient
training and implementation. The pipeline architecture with parallel computations is
conducive to a VLSI implementation. The design goal is to be able to train with
examples that only contain one signal, but use the architecture to resolve the
components of multiple signals.

The coarsc stage is responsiblc for dctecting thc presencc of a signal in a relatively
large portion of the Time-Frequency space and transforming the input data into a
more desirablc form for further processing. This transformation includes the
calculation of the various wavelct reprcseniations of the input signal at successive
resolutions and their corresponding spectral componcnts.

In the fine stages, only those networks associated with a detection in the coarse stage
are implemented. If multiple signills are identified at the coarse stage, all of the
necessary rcgions will be processed further. If a coarse region is classified as having
no signal, no further operations will bc performed on that area.

66
asl

Figure 4: One Signal Performance

For both thc timc and frequency dccision, quadrature mirror filters arc uscd to
represent thc signal in thc form of thc Mallat wavclct transform implcmcntation 191.
This technique has hccn chosen as a incans of reducing thc input fcaturc size (and
therefore ~ h nctwork
c sizcs) from slzige to slagc. The liltcring also rcmovcs many of
the unwantcd rcgions of the timc-frcqucncy region making the network training and
implementation easier and allowing for the analysis of multiplc signals with thc same
networks that werc trained with examples of one signal. In addition, for the arrival
timc discriniination, thc modified wavelet rcprcscnKition makes use o f the propcrty
that thc onsct of thc transient will line up across diffcrcnt scales [10,11]. The
devclopnicnt 01' the modified wavclct fcaturc vector is dcscribcd in refcrcnce 13I.

Thc iinplcrncntation of the mullisrage architecture for thc lrcqucncy fcaturc cxtraclion
is shown in Figurc 2. The first stages split thc frecjucncy band into two parts based on
the network compuhtion of the spectrum of the input signal. I f a signal component is
detectcd, thc ncxl stages uses the next rcsolution as its input fcaturcs. Becausc of the
decimation a t each stagc of the wavclct packct gcncration, thc networks at each slagc
arc smaller. Each network is traincd separately, but Lhc smallcr network arc easier to
train. Alstr, the smallcr networks usc fewer weights and therefore simplify the
computation during implcmcntation. An example of thc multistage implementation
for t = IT and f = 7F is shown in Figure 3. Random sccds arc chosen to ensure that
the test cxamplcs arc different than the training cxamplcs. Test rcsults for two of thc
64 cclls arc shown in Figure 4.

Whcrc the nctwork is shown to gcncralizc for diffcrcnl signal to noisc ratios. The
pcrccnt corrcct rcsults for the two casts dcscribcd at 21 dB arc plottcd for 24, 18, 15,
12 and 9 d R S N R Icvcls. The dotted lincs rcprcscnt the 95% confidcncc intcrvals for
thesc mcasurcmcnts. The performance degrades as cxpcctcd at lowcr SNR Icvcls, but
as in the siniple casc the ncural nctworks u c only prcscntcd with examples having 21
dB SNR a n d gcncraliLc for thc othcr cases.

67
References

D.E. Rumelhart, J.L. McClelland, and the PDP Research Group, Parallel
Distributed Processing, Explorations in ihe Microstructure of Cognition,
Volume I : Foundations, MIT Press: Cambridge, MA, 1989.

E. Leven, N. Tishby, and S.A. Solla, "A Statistical Approach to Learning


and Generalization in Layered Neural Networks," Proceedings of the IEEE,
Vol. 78, No. 10, October 1990, pp. 1568-1574.

E. Wilson and D.W. Tufts, "Neural Network Design Algorithm and


Multistage Structure," submitted to IEEE Transactions on Neural Networks
8/93.

R.N. McDonough, " A Canonical Form of the Likelihood Detector for


Gaussian Random Vectors," Journal of the Acoustical Society of America,
Vol. 49, 1971, PP. 402-406.

E. Wilson, S . Umesh and D.W. Tufts, "Resolving the Components of


Transient Signals Using Neural Network and Subspace Inhibition Filter
Algorithm," Proceedings of the IEEE lnternuiional Joint Conference on
Nerrral Networks, Vol. 4, 1992, pp. 283-288.

E. Wilson, S . Umesh and D.W. Tufts, Designing a Neural Network


'I

Structure for Transient Detection Using the Subspace Inhibition Filter


Algorithm," Proceedings of the IEEE Oceans Conference, Newport, RI, Vol.
1, 1992, pp. 120-125.

J .L. McClelland and D.E. Rumelhart, Explorations in Parallel Distributed


Processing: A Handbook of Models, Programs, and Exercises. MIT Press:
Cambridge, MA, 1988.

E. Wilson, S . Umesh and D.W. Tufts, "Multistage Neural Network Structure


for Transient Detection and Feature Extraction," Proceedings of the IEEE
International Conference on Acoustics, Speech, and Signal Processing, Vol.
1, 1993, pp. 489-492.

S . G . Mallat, "A Theory for Multiresolution Signal Decomposition: The


Wavelet Representation," IEEE Transaction on Paffern Analysis and
Machine Intelligence, Vol. 11, No. 7, July 1989, pp. 674-693.

S . Mallat and S . Zhong, "Wavelet Transform Maxima and Multiscale


Edges," in Wavelets and theirsApplications,ed. M.B. Ruskdi, et al. Jones and
Barlett: Boston, 1992, pp. 67-104.

S . Mallat and S. Zhong, "Characterization of Signals from Multiscale


Edges," IEEE Transaction on Pattern Analysis and Machine Intelligence,
Vol. 1 1 , No. 7, July 1989, pp. 710-732.

68

You might also like