Advanced Signal Processing Techniques: October 1991

Download as pdf or txt
Download as pdf or txt
You are on page 1of 67

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/235054832

Advanced Signal Processing Techniques

Article · October 1991

CITATIONS READS

18 40

3 authors:

Bruno Andriamanalimanana Jorge Novillo


State University of New York Institute of Technology at Utica/Rome SUNY Polytechnic Institute
11 PUBLICATIONS   27 CITATIONS    5 PUBLICATIONS   34 CITATIONS   

SEE PROFILE SEE PROFILE

S. Sengupta
State University of New York Institute of Technology at Utica/Rome
9 PUBLICATIONS   26 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Bruno Andriamanalimanana on 22 May 2014.

The user has requested enhancement of the downloaded file.


AD-A244 148
RL-TR-91-262
Final Technical Report
October 1991

ADVANCED SIGNAL
PROCESSING TECHNIQUES

JENS Research

B. Andriamanalimanana, J. Novillo, S. Sengupta

APPROPD FOR PUSUC REE4SE,1


DSTRIBU77ON UNIMIE'D.

Rome Laboratory
Air Force Systems Command
Griffiss Air Force Base, NY 13441-5700

92-00290
'1i1III I11
H,111 II!-
11
This report has been reviewed by the Rome Laboratory Public Affairs Office
(PA) and is releasable to the National Technical Information Service (NTIS). At NTIS
it will be releasable to the general public, including foreign nations.

RL-TR-91-262 has been reviewed and is approved for publication.

APPROVED: = I
TERRY W. OXFORD, Captain, USAF
Project Engineer

FOR THE COMMANDER:

GARRY &h BARRINGER


Technical Director
Intelligence & Reconnaissance Directorate

If your address has changed or if you wish to be removed from the Rome Laboratory
mailing list, or if the addressee is no longer employed by your organization, please
notify RL( RAp) Griffiss AFB NY 13441-5700. This will assist us in maintaining a
current mailing list.

Do not return copies of this report unless contractual obligations or notices on a


specific document require that it be returned.
Form A,,proveal
REPORT DOCUMENTATION PAGE I OMB o.0d
Pbkwa ouamter ru cabomo d anui a eaniwd tosam i I5 am resaari rcichi fro trm ta rdw-rvauar seawwE tursicq Cara sD.a :
gwwr u" n - to.' "w Che Iu. Wmu i w~druinmcVram~ Send i rrwis reg.cqu-vtrE wn~egrrge or a'i cxre, -enea 2
caVtmd
hlakwx
o N ftio
a.. tarred &Wrnbzwt0WWWar HMftinws SivcasDkamfira IrmtVcgOpegicmrtisPep s125e2

1. AGENCY USE ONLY (Leave BlankI 2. REPORT DATE 3. REPORT TYPE AND DATES COVERED
Final
October 1991
4. TITLE AND SUBTTLE 5. FUNDI a
ADVANCED SIGNAL PROCESSING TECHNIQUES C - 30TN--D-0026,
Task 1-9-4265
PE - 31001G
6, AUT-OR() PR - 3189
B. Andriamanalimanana, J. Novillo, S. Sengupta TA - 03
W4U - PI
7. PERFORMING ORGAIIZATION NAME(S) AND ADDRESS(ES) & PERFORMING ORGANIZATION
JENS Research REPORT NUMBER
139 Proctor Blvd
Utica NY 13501 N/A

9. SPONSORING/MONITORRN AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSORINGIMONITORING


AGENCY REPORT NUMBER
Rome Laboratory (IRAP)
Griffiss AFB NY 13441-5700 RL-TR-91-262

11. SUPPLEMENTARY NOTES


Rome Laboratory Project Engineer: Terry W. Oxford, Captain, USAF/IRAP/(315) 33C-458L
Submitted by JENS Research to Calspan-UB Research Center
12a. OSTRFUMLmOAVAILAB/IY STATEMENT 12. DISTRIBUTION CODE
Approved for public release; distribution unlimited.

13. ABSTRACT(INY 'Wmo


The research efforts under this program have been in the area of parallel and
connectionist implementations of signal processing function,; such as filtt-rin-._
transforms, convolution and pattern classification and clustering. The topics
which have been found to be the most promising, in as far as their novelty,
and applicability to the classification of pulses are covered in the followinA
four sections:

I. Algebraic Transforms and Classifiers


2. An Interaction Model of Neural-Net based Associative Memories
3. Models of Adaptive Neural-Net based Pattern Classifiers
4. Discrete Fourier Transforms on Hypercubes
NOTE: Rome Laboratory/RL (formerly Rome Air Development Center/RADC)

14. SUBJECT TERMS SNUMA8E


P F = -
An Interaction 'Inde1 of Neural-Net basedI Associative 'emori, ___ ___

Content Addressable lemories, Discrete Fourier Transforms It PRIcE cocE

17. SECURITY CLASSIFICATION 18. SECURITY CLASSIFICATION 19 SECURITY CLASSIFICA11ON 20. UMITAT1ON OF ABS TRACT
OF REPORT OF THIS PAGE OF ABSTRACT
UNCLASSIFIED [NCLASSIFTED (INCLASSTFTED (IL
NSN 7S4dM -aMAM SIw-A"m19arm ~
P1 b ANSI SIG r~

2W02
Advanced Signal Processing Techniques

Technical Summary

Our research efforts have been in the area of parallel and connectionist
implementations of signal processing functions such as filtering, transforms,
convolution and pattern classification and clustering.

Much more than is reported here was reviewed and entertained for further study
and potential implementation. For example, we considered in detail
implementation of techniques common to all filtering such as the computation of
recursive linear equations using systolic arrays and pipeline and vector processors.
Similarly, various neural architectures were studied and implemented. These
include Hopfield, Hamming, Back Propagation and Adaptive Resonance Theor-
Networks.

We report here those topics which we found to be most promising in as far as their
novelty and applicability to the classification of pulses and additionally on a new,
more transparent method for the computation of the discrete fourier transform on
hypercubes. The report consists of four sections:

Algebraic Transforms and Classifiers presents a two-layer neural net classifier with
the first layer mapping from feature space to a transform or code space and the

second layer performing the decoding to identify the required class label. This
research strongly suggests the further investigation of new types of neural net
classifiers whose kernels can be identified with linear and nonlinear algebraic
binary codes.

An Interaction Model of Neural-Net based Associative Memories presents n new


model for Content Addressable Memories which takes into account pattern
interactions generated dynamically over time. It is hoped that this proposed
scheme will yield low error rates.

Models of Adaptive Neural-net based Pattern Classifiers deals with design and
implementation issues at the operational level of ART-type parallel distributed
architectures in which learning is asynchronously mediated via a set of concurrent
group nodes comprising a number of elemental processors working on component
substrings of the input pattern strings. A single-slab Bland-net alternative
architecture is also proposed.

Discrete Fourier Transforms on Hypercubes presents a decomposition of the


transform matrix that is easier to understand than others found in the literature
and a basic scheme for power-of-prime length transform computations on the
hypercube. For lengths that are not a power of a prime, the use of the Chinese
Remainder Theorem in pipelining the computation is proposed and discussed.

Acoesslon For
NTIS GRA&I
DTIC TAR l

__I 2

.4vall oned/or
Mat Special
ALGEBRAIC TRANSFORMS AND CLASSIFIERS

A two-layer neural net classifier is proposed and studied. Codewords from a {1,-I}

algebraic code are "wired' in the net to correspond to the class labels of the classification

problem studied. The first layer maps from feature space to some hypercube {n,-1} n ,

while the second layer performs algebraic decoding in {1,-1} n to identify the codeword

corresponding to the required class label.

The main motivation for introducing the proposed architecture is the existence of a large

body of knowledge on algebraic codes that has not been used much in classification

problems. It is expected that classifiers with low error rates can be built through the use of

algebraic coding and decoding.

1. Neural network classifiers

Artificial neural nets have become the subject of intense research for their potential

contribution to the area of pattern classification [31. The goals of neural net classifier

construction are manifold and include the building of lower error rate classifiers than

classical (i.e Bayesian) classifiers, which are capable of good generalization from trainii.

data, and having only reasonable memory and training time requirements. Most neural r1(i

classifiers proposed in the literature adjust their parameters using some training data. In

this proposal, training has been set aside to bring out possible benefits of algebraic coding

In subsequent work, classifier parameter adjustment will be studied in conjunction with the

algebraic coding approach.

2. Net topology and operation

The neural net consists of three layers of units connected in a feedforward rnialmer.

The number of units in the input layer is the length k of each feature vector. The hiddcn

layer has as many units as the length of the algebraic code used. Finally the output layer

3
contain one unit for each class in the classification problem. The chosen codewords from

the algebraic code used are wired into the net via the interconnection weights betwccrn th.

input and the hidden layers. The decoding properties of the algebraic code come into play

in the interconnection weights between the hidden and the ouput layers. The actual choice

for these weights will be detailed below. Note that even though the topology chosen here

resembles other popular neural net topologies, such as that of the back-propagation

network, this first version of our classifier is non-adaptive and non-learning (the pattern

classes are not learned but wired in from some a priori knowledge), and the time needed to

recognize a pattern is only the propagation delay through the two layers of weights.

Figure 1

4
The network operates as follows. A feature vector f to be classified is presented to the

input layer; this feature vector is a corrupted version of some class representative I (the

latter associated with the codeword c). The second layer of units outputs a vector u, the

'frequency domain' transform of the feature vector f, which is a corrupted version of c. The

output layer decodes u into the codeword c, thereby identifying the class represented by 1.

The output layer does this by having all units 'off' except the one corresponding to

codeword c, i.e class 1.

To make the net operate as required, we need to devise schemes for wiring codewords in

the interconnection weights so as to achieve a 'continuous' mapping from feature space to

code space (feature vectors that are close correspond to vectors in {1 ,- 1 }n that are also

close; this is important if the error rate is to be kept low), decide the transforriatiOl

performed by each unit and find a way of making the decoding of a corrupted codeword

correspond to turning on exactly one unit in the output layer. Finally, we must analyze the

kind of algebraic codes that are suitable for the proposed architecture.

First, note that the input layer of units will simply buffer the feature vector to be

classified; hence the units here perform the identity transformation. Secondly, it should b,

clear that the units in the hidden and output layers must be polar (i.e they output I or -1:

actually we could have made them binary but it is easier to deal with polar outputs for

neural nets). We have decided to make them threshold units, outputting +1 if the net

input is at least the threshold e of the unit, -1 otherwise. The value of 0 will be determined

below. But first we need to make some elementary observations about {1,-1} codes.

3. Remarks on {1,-1} codes

We remark that if c and c' are two {1,-1} vectors of the same length n. then

c . c' + 2 distance(c,c') = n

where the dot is the usual dot product and distance is the Hamming distance between the
two vectors. This is so because for {1,-1} vectors, c . c' is the number of coordinates where

the two vectors agree minus the number of coordinates where they disagree; hence c . C" I-

n minus twice the Hamming distance of the two vectors.

A simple consequence of this observation is that if C is a (1,-1} code of length n, i.e a

collection of distinct {1,-1} vectors all of length n, and if M is the maximum dot product

among different codewords of C, then the minimum Hamming distance among different

vectors of C is

0.5 (n - M)

and thus the code corrects any vector corrupted by less than

0.25 * ( n - M )

errors, using the usual nearest neighbor decoding scheme. Because of this. we

looking for {1,-1} codes where the length n is much larger than the maximum dot product

M.

Next, we remark that if c is the closest codeword in the code C to a {1,-1} vector u, and if

w is any other vector in C, then the following inequalities hold

w . u < 0.5 * (n + M) < c . u

This can be seen as follows. Let u E {1 ,- 1 }n and let c be the codeword in C closest to u

(notc that at first sight there may be several codewords in C having minimum distance

from u; however if u has not been corrupted by 0.25 * (n - M ) or more errors, then this is

impossible!; we will assume this from now on).

Now let u = c + e , where the error vector e has its q nonzero components equal to 2 or

-2. We have

distance(u,w) > distance (u,c), for all w C -{1'}

Now

c . u = n - 2 distance(c,u) = n - 2q

6
Since it is assumed that less than 0.25 * ( n - M) errors have been made in corrupting c to

get u, we have

q < 0.25 (n - M)

Hence

c . u =n - 2q> n - 0.5* (n - M) =0.5 (n + M)

Also,

w.U = w. (c +e)

-w.c +w.e

<M + 2q , because the q nonzero components of e are 2 and -2

<M +0.5* (n - M) =0.5* (n + M)

Hence the claim is proved.

Note that the last observation will be used to decide the threshold for the units in tic

output layer, and the interconnection weights between the hidden and output layers.

4. Weight matrix from hidden to output layers

Choose a suitable code C (suitable in the sense of having the length n much larger thaw

the maximum dot product M) and select m codewords

c 1 , c 2 , ... Ic m

from the C. Writing the components of c i as

CillCi2, ... , Cin

use the following matrix as the weight matrix from the hidden to output layers

[C11 C1 2 . Ci,, C1 "


C 21 C22 "C2n C2

Cm
n. . . .

CmI Cm2 " Cmn

7
In other words, the weights from the hidden layer to the output layer are all 1 or -1, and

the weights from the n hidden units to the jth output unit are the components of thlL

codeword c. (see Figure 1).

Now if the output of the hidden layer is the vector u, then the net input to the jth output

unit is

C.
J . u

Hence if the threshold of each unit in the output layer is chosen to be

0.5 * (n + M)

then by the observation in section 3. only the unit corresponding to the codeword closest

to u will output 1, the rest -1.

5. Weight matrix from input to hidden layers

The threshold in the units of the hidden layer are chosen to be 0. Now let Wkxn be the

weight matrix between the input and the hidden layers. Let Rmxk be the matrix whose

rows are the class labels in the classification problem at hand.

r1l r 12 r"" r1

r 21 r22 "2kr 2

Rmxk •

r'mt in2 • imk


LrmJ

Several schemes can be imagined for computing the matrix Wkxn from the mnitrilcc !>

and C
m~xn

8
The first scheme is a simple correlation, defined by

W kxn = Rmxk C1Xn


T
where Rmx k denotes the transpose of the matrix Rmxk* In other words, the weight fror:

the ith input unit to the jth hidden unit is the sum of the correlations between the ith

component of a class label and the jth component of the corresponding codeword (see also

Figure 1).

m
Wij E r1i c1j
1=1

The idea here is a natural one. In order to achieve continuity, we seek to strengthen the

connection between a class label and its corresponding codeword.

Another scheme 'simulates' a linear transformation from feature space to code space. If the.t

hypercube {1 ,- 1 }n were a vector space, which it is not, and if the class labels, i.e the rows

of the matrix Rmxk, were linearly independent, then there would be a unique linear

transformation mapping class label r i onto codeword ci, for all i with I < i < m. In other

words, there would be a unique kxn matrix W with

Rmxk . = Cmx n
If R were square, it would be nonsingular and we could write
W = (R mxk)" C mxn

which would give the interconnection weights we are looking for (recall a linear

transformation between finite vector spaces is always continuous).

Of course, the above assumptions do not hold in most cases. We may though use some

kind of approximation of the inverse for R, called the pseudoinverse of It and (enote, P

(we have dropped the index rnxk for simplicity).

9
If the matrix R is of maximum rank, but not square, then R . R T is nonsingular and the

pseudoinverse R is defined by

R-=R T(R.RT)

If R. RT is not invertible, then R is given by a limit

R =Iim(R . R.RT+ 2 1 ) 1)as - ,- 0

where I denotes the identity matrix of the size of R . RT and c is a positive real number.

It can be shown that the pseudoinverse R always exists. Of course, in practice, we ouLy

compute an approximation of the limit most of the time, since the exact form of R ma%

be difficult to obtain. For a nonsingular matrix, the ordinary inverse and the

pseudoinverse are of course identical.

In any case, we can take as weight matrix from the input laver to the hidden Liavr III,

product

Wkx n = R .Crx n

6. Hadamard codhig

It is believed that the error rate of the proposed classifier depends heavily on the numbler

of errors that the algebraic code wired into the net can correct, which, a~i we hav, s,.

a linear function of ( n - M ), where n is the length of the code and MNis the maximum dot

product. We then need to investigate {1,-1} codes for which n is significantly larger thn,,

M.

The class of Hadamard c-)des seems to provide a reasonable starting point. Recall that

Hadamard matrices constitute a handy tool in many disciplines, including signal processing

(Hadamard matrices are the kernel of the so-called Hadamard-Walsh transfor, n,

design of experiments (Hadamard matrices give rise to good symurnetrh citi

designs). Even though Hadamard matrices exist only for certain orders, we believe ti.t

they are still quite useful for the classifier architecture proposed.

10
A Hadamard matrix of order n is a square matrix H of order n, whose entries art 1 or -1.

and such that

H. HT=nI

where I is the identity matrix of order n.

The definition really says that the dot product of any two distinct rows of H is 0, hence if

we take as codewords all the rows of H, then M = 0, i.e the code can tolerate any number

of errors less than 0.25n. Of course the threshold of each output unit of the classifier mrus

be set to 0.5n.

Examples of low-order Hadamard matrices are shown in Figure 2, where -1 is simpky

written -.

n 1: Hi[1] n = 2: H2 [ 1]
n =4: H4 = 1 1 1 1 n 8: H8 1 1 1 1 1 1 1 I
1- 1- 1-1 -1 -1-

11-- 1 1-- 1 1- -1

-11- - -1 - -1 - -

1 1--- 1 1'.

1 - - -1 1 1 !I

Figure 2

It can easily be shown that a Hadamard matrix of order n can exist onk'. if n 1. 2 or 2

0 mod 4. It has not yet been proved that a Hadarnard matrix of order n exists, w,,,,

is a mutiple of 4, even though this is widely believed.

11
For our purposes it is enough to note that Hadamard matrices do exist for orders that are

a power of 2. This is seen through the fact that if H n is a Hadarnard matrix of order 11.

then H2n is a Hadamard matrix of order 2n, where

H2n= H -H

7. Simplex coding

Recall the construction of the finite field GF(p k ) with pk elements and prime subfield

GF(p) = Zp, where p is a prime number (p = 2 is the most important case for our

classifier). An irreducible polynomial h(x) with coefficients in GF(p) and of degree k i;

chosen. The elements of GF(p k ) are taken to be all polynomials with coefficients in (;lF;,

and of degree less than k. Addition and multiplication are ordinary addition and

multiplication of polynomials but performed modulo h(x). It can easily be proved that

indeed GF(pk) satisfies all axioms of a field and that it contains pk elements.

It is well-known that the multiplicative group (GF(pk) - 0, *) of GF(p k) is cyclic of


k
order p _ 1. If x, as element of the Field, is a generator of this cyclic group, then hjx) ':

called a primitive polynomial over GF(p). Note that databases of primitive polynomial!.

over GF(p) are available for many values of p.

Let

h(x) = xk + hk- xk1 + hk.2 x k 2 + ... + h 1 + h0

be a primitive polynomial of degree k over GF(2). So h. = 0 or 1 for all i and the

irreducibility of h(x) implies that h0 = 1.

We define the pseudo-noise sequence generated by h(x) and given bits aO di .........

the binary sequence

aO , a 1 , ... , ak 1 , ak, ak+ 1,

12
where, for I > k,

a I = alk + al-k+lh I + a1-k+ 2h2 + a,. 1hk- 1

(addition and multiplication are over GF(2), i.e modulo 2).

The following properties are well-known [4]:

(i) Let C = alal+lal+ 2 ... al+n.1 be a subsequence of length n of a pseudo-noise

sequence (ai) generated by h(x), where

n=2 k _1

Then for any cyclic shift C' of C, C + C' is another cyclic shift of C.

(ii) If C and n are as in (i) then C has 2k-1 ones and 2k-1 - 1 zeros.

Now from a psudo-noise sequence generated by h(x), a binary code ot length ii S 2

denoted C-, is constructed by taking as codewords the vector

a0 , a1 , ... , a2 k. 2
and all of its 2 k - 2 cyclic shifts.

The code C- together with the all-zero vector of length 2k - 1 is also known in tl,

literature as the cyclic simplex code of length 2k - 1 and dimension k. By property (i)
n
above, the simplex code is a vector subspace of dimension k of GF(2) , where n = 2 -

From C- a {1,-1} code C is constructed, through the affine mapping a - 1 - 2a (i.e replace

0 by I and I by -1).

For u, w E C, we can compute u.w as follows.

If u = w then clearly u.w = n = 2k - 1.


If iz , w then u.w = n - 2 distance(u,w) as explained in section 0. But dist :int,' i i,

distance between the corresponding vectors in C-, which by linearity of the s1iTpicx co-,h

C- U {O}, equals

13
distance(0,c 2 -c 1 )

= distance (0, c. +c 1 )
=weight (c 2 +cl )
= 2 k - 1 by property (ii) above.

So,

u.w = n - 2 distance(u,w)
= 2k - - . 2 k-1=
1 2 1

Hence for the code C, the maximum dot product is M = -1. C can tolerate any number

of errors less than 0.25 * (n+1) = 2 k - 2. The threshold of each output unit of the classifier

must be set to 0.5 * (n+1) =2 - 1

We give a couple of examples to illustrate the construction.

Forh(x) =x 3 +x+ 1, anda 0 =0,a 1 =0anda 2 = 1, we get

a 3 = a0 + a,. h1 + a 2 . h 2 = 0

a4 =a1 +a2.h 1 + a 3 .h 2 = 1

a 5 =a 2 + a3 . h 1 + a4 . h -1

a6 =a 3 +a 4 h1 + a5 . h -1

The codes C and C are

0010111 11-1
1001011 -11 -1--
1100101 - -1 1 - 1-
1110010 1 - 1 1
0111001 1
1011100 1 -1

for which we can see that the minimum distance is indeed 4.

14
Another example is provided by h(x) = x 4 + x + 1, a 0 = a1 = a2 = 0 and a 1. \c

compute

a4a5a6a7a8a9a
1 1 2 a1 3 a1 4
01a1 a = 00110101111

The code C isshown below, for which the {1,-1} code C iseasily obtained.

000100110101111

100010011010111
11000100110101i

111000100110101
111100010011010
011110001001101

101111000100110
010111100010011
101011110001001
110101111000100
011010111100010
001101011110001
100110101111000
010011010111100

001001101011110

8. Future work

We need to analyze more classes of algebraic codes suitable for the proposed architecture.

One of the promising directions is the area of codes from block designs (as mentioned

earlier, the Hadamard codes actually fall under this area). Properties of these codes arc

often distilled from the geometric properties of the designs. Work in hi ar,',

progress.

Next we should actually attempt an analytical study of the error rate of the proposed

1S
classifier. Note that the mapping from feature space to code space may actually introdue

more errors. Even if this mapping does not introduce any error, it may still be dIj cuc 1,.

derive a closed-form of the error rate, so bounds may be the best we can do.

Finally we need to design an adaptive version of the proposed architecture, where the code

is somehow wired in the net but a bettering of the estimates of the parameters of the

classifier is also accomplished through training data.

9. References

[1] Hopfield,J.J.,Neural networks and physical systems with emergent collectivi

computational abilities, Proc. Nat'l. Acad. Sci., USA, v.79, pp. 2554-2558, 1982.

[21 LippmannRichard P., An introduction to computing with neural nets, IEEE Acoustic-

Speech Signal Processing Magazine, v. 4, pp. 4-22, 1987.

[31 Lippmann,Richard P., Pattern classification using neural networks. IEE[

Communications Magazine, November 1989, vol. 27, no. 11.

[41 MacWilliams, F.J and N.J.A. Sloane, The theory of error-correcting codes, 2 volumes.

North-Holland, 1977.

5 Widrow, B. and R. Winter, Neural nets for adaptive filtering and adaptive p:

recognition, Computer, v. 21, no. 3, 1988.

16
An Interaction Model of
Neural-Net based Associated Memories

1.Introduction
One of the most focused topic in the entire neural network paradigm that still contin-
ues engagement of major research effort is evidently the topic of Associative or Content-
Addressable Memory (CAM) in the context of a Connectionist framework (9). An Associ-
ative Memory (AM) or a Content-Addressable Memory is a memory unit that maps an
input pattern or a feature vector Xk E Rn to an associated output vector Yk E R. in O(1)
time. The input pattern Xk may be a substring of the desired output pattern Yk, or ma; be
totally distinct; however, in both cases, we identify an association of the form (xk, Yk) for
each pattern index k. An AM or CAM is a storage of all such association pairs over a
given pattern domain. The input string is also known as the cue-vector, or a stimulus, the
output string, the associated response pattern.
Viewing it from a neural network perspective, the ensemble, even if conceptualized as
a pattern classifier/recognizer in a strict functional sense, is essentially an AM!iCAM entity
if it works correctly. Because ultimately, a neural-net based classifier 'recognizer logtcui.:
maps an external object via an appropriate feature space onto distinct claz:,:!, wt: cj>
say that in a functional context (object - input pattern (stimulus), pattern-clz,,
comprise associative pairs. In this section we present a new model of Associated Memories
or Content-addressable Memories realizable, say, via an appropriate optical implementa-
tion of an artificial neural net type system using lens, masks, gratings, etc (4,6). The issue
addressed here is not this realization task per se, but the more important issue of models
basically congruent to neural net type systems from the point of view of functionality and
performance, the models which reflect distributivity, robustness, fault-tolerance. etc. T
implementation issue has to be addressed eventually - one could aim for a neuira ne,
system in an electronic (as opposed to optical) environment but that as a topic must',
till one clears out the model related issues, the overall architectural framework of achievl-
a certain set of objectives at the topmost level of specification.
In this paper, a matrix model of Associative memory is proposed. The strength of the
model lies in the way it is different from the traditional AM models articulated thus far
(see Hopfield (7,11), Kohonen (8,9) etc. for Associative Memory models). This model
becomes significant at high storage density of patterns when the logically neighboring pat-
terns do tend to perturb each other considerably. A memory system even locally stable in
the sense of Lyapunov at or around specific logical sites may tend to display erraticitv
upon receiving another pattern for storage if the pattern storage density increases (12.3)).
In such cases, instead of trying to keep patterns away from interacting with each othr.
, -
one could de;iberatev allow them to interact and then consider all such co'ec,-ive i::,'n
tio1s in the overall dynamics of the system. To date, no such model h;s been nr :

Our model is different from the proposed classical memory models in one s'
sense. It allows interactions among patterns to perturb the associated feature space. As ,.e
pack dynamically more and more patterns into the system, more skewed becomes the
neighborhood' of a pattern. It is analogous to the situation where an electric field around
17
a charged element is peturbed by bringing in more charge particles in the vicinity. It is
this recognition in our model that provides a changed perspective.
We observe at the outset the two salient features associated with neural net type pro-
cessors or systems using such processors. First, a neural network model is ultimately a
content-addressable memory (CAM) or even an an associative memory (AM) depending
upon how it is implemented. The central theme in such models is storing of information of
the relevant patterns distributively in a holistic fashion over the synaptic weights between
pairs of neuron-like processor elements allowing both storage and retrieval tasks more or
less in a fail-soft manner. Even though the time-constant of a biological neuron's switching
time is of the order of milliseconds, storage and recall of complex patterns are usually car-
ried out speedily in a parallel distributed processing mode. It is ultimately this notion of
distributed memory in contrast to local memory of conventional computers that stands out
in so far as neural net type systems are concerned.
A neural network system could also be viewed at the same time as a pattern
classifier/recognizer. Problems like Radar target classification, object identification from
received sonar signals are where neural net type systems are most adept. In particular, if a
recognition of a partially viewed object or a partially described object has to be made, if
the data is further contaminated with white noise then not only a neural-based classifier
appears to be more promising for real-time classification from the point of" vie,.v ofL per:fr-
mance but a CAM or an AM oriented system might just be ideal. It depends, of course, or,.
how we ultimately structure the problem domain, what type of cues are allowed for
appropriate responses, etc. but it is eminently possible to lean on a specific CAM/AM
oriented neural-net paradigm to do just that. In this section, a generalized outer product
model is proposed.
The associated-memory elements ought to yield a performance close to 0(1) search
time. However, once pattern-pattern interactions are allowed within the model one cannot
guarantee 0(1) performance. In our model, it is feasible that on some cue patternS t:z,
problem is so straight forward that one need not consider pattern-pattern interac:(or' at
all. In such cases, it behaves as classical AM element with 0(1) search-time.
Consider an orthonormal basis {Ok} spanning an L-dimensional Euclidean vector
space in which associated pairs of patterns (cue, response) in the form (Xk, Yk) are stored
in correlation-matrix format of association. The whole idea is to buffer the stored patterns
from the noisy cue vecors via this basis such that one could see the logical association
between x and y via a third entity, our basis. Note that a cue vector Xk in this basis is
expressed as

IXk > I>


aid 10,

with its transpose as

<Xkl= ak1 <il

18
such that the dot product between two such vectors xk > and x, > is

<Xkk > = Ckkia iJ5i


ij

= OIki ali = ak1

The Euclidean norm of a vector is then

2
<XkI Xk > = (fki) = akk

Note that the cue-vectors themselves, in this model, need not constitute an orthogonal vec-
tor space. Even though this is a desirable assumption to consider, in reality, it may not be
a good assumption to depend on. Secondly, the vector space {0) need not be complete, an
appropriate subsapce would do as indicated by Oja and Kohonen (13). They build the pat-
tern space over some p-dimensional subspace with p :_L and deal with a pattern by hav-
ing its projection on this subspace and extrapolating its residual as an orthogonal pro~ec-
tion on this space. We could also extend our model in this way.

We could interpret the coefficients aki associated with a cue-feature as follows. The
normalized coefficient aki/ ak, 2 could be regarded in this model as the probability ampli-
J
tude that the vector I xk > has a nonzero projection on the elemental basis vector 1 6i >. The
square of such terms are the associated probabilities; if we want to picture it a cue vector
in this representation would be a sequence of spikes on the 6-domain whose height at an,.-
specific 0-point is the probability amplitude that the vector there has a nonzero projec::o0:.
Note that the sum of the square of all these projections must necessarily equal to 1.0 for a
nonzero cue-vector.
Given (cue,response) pairs we form matrix associative memory operators as follows:

= IIYk><XkI
k

Mr = I k><k
k

The operator K~ on IXk > cue is

> ZO~> <XkkI >


k

=11 > all + E ak Ok


k--I

19
Such that its projection on any arbitrary I Oq > is

<Oq I MN1 x1> = all 6 , + a,,,

We call this the figure of merit of a cue x, on a basis vector 0(, fm(xl
Obviously, the memory operator has the largest projection when q=l, i.e.

index {max { fm(xI I Oq) } = 1


q
Note that if max { fm(x1J0.,), s * q } = all, then the cue x, may not be discernible
S
from another cue associated with which is an altogether different response. We denote the
associated domain of certainty by a resolution measure res(ys), where

res(ys) = res(y, I s)

= 1- {(max [fr (xj¢,)1


all s*.q

In the simplest case, one assumes res(.) on associated pattern vectors to be very large. 0-,
equivalently, one assumes that the storage of multiple (cue,response) pairs is facilitatec bv
the requirement that the Hamming distance between any two pair of such cues in the
library is relatively large. In such cases, after receiving the index value I associated with
the given cue Ix, >, one continues by operating Mr on 1i,1 >. Then

Mr101> = E lYk><¢kl(01> =IY1>


k

indicating successful retrieval of the associated pattern V>.


Note that, in general, a received stimulus may not appear in the form as one desire.
It could be a noise contaminated incomplete cue-function of the form

lx> = -Ilxl> +lb>, where <x1 16> =0

and, 0 < -1 < 1. Then a memory operation on x is

M> =-M Ix,> + M I,>


-tall I 1> + -1E ak, I(O,> + I- k><xk b'>
k*l k

Now, given 65= > we obtain

<xk 6> - ki

20
<0km IMs Ix> = (a m + am1) + E Qiki

The matrix elements show deterioration of performance. Since, the cue-vectors iI,e.d n)., 't.
general, form an orthonormal basis the first term on the right hand side of the matrix ele-
ment would, in general, contain the usual cross-talk component, but now, in addition to
this, the distortion 6, even though orthogonal to the most likely cue-vector xI, would intro-
duce additional degradation.
The question of optimization of the memory space is now the issue. Given the possi-
bility of receiving distorted signals at input ports whence one obtains the cue-vector in
question, one may approach the design problem in a number of ways.
One simple approach, particularly when storage density is sparse, is to consider the
possibility that the received cue x is close to one of the cues considered acceptable. Under
this assumption, we form both the on and the off-diagonal term- <0q NQ x> and note the
index q for which the matrix element is largest. We then logically associate x with y,, or.
the assumption that yq is what is stored logically at the basis Oq. The task of retrieving
the associated pattern given a cue is then a straightforward problem as long as a relatively
sparsely dense set of (cue,response) pairs is associatively stored. The problem occurs when
cues are c!ose to each other or when one finds that a single level discriminant may not s.F-
fice. If an unknown cue x is not stored precisely in the format it has been received. it it i
equally close to xq as it is to Xr, with q * r, we must have further information to break the
impasse.
One simple strategy is to extend the proposed abstract AM model as follows. In this
case, we do not store just single instance (cue, response) items but, instead, a class of
items. In other words, we prepare a class-associative storage ( or a class-content address-
able storage) in the form below

Ck = {(Xlk,Yk), (X-k,Yk), ... , (Xjk,Yk),-.- }

Mcs -E k><XikL and


ik

c,= E bk><OkI
k

Though the essential structure of the tuple (cue,, cue 2 , ..., response) is simple enough it
could be exploited in a number of ways allowing us to extend the scope of simple
CAM/AM based storage/retrieval task either through a spatial or a temporal extension, or
a combination of both.
In the sequel, we conjecture some simple schemes.

A. Spatial Extension.

1. Concurrently, obtain , input signals {xi, 1 < i < n}. These are n different
instances of same measure or different measures of same feature element. The
21
underlying hypothesis is that all these xi measures point to essentially the same
response vector y.

2. Obtain, in parallel, (xi, yi).

3. At next cycle, at the next higher level, obtain r7(yi) for every occurrence of yj.
q/(.) yields the number of occurrence of yi.
S

4. Obtain y =max {?(yj(xj) } where


Yi

(x1 ) = max fm(xil Oq)

5. Output (y* I X1, x 2 , ... ,xA)

In this case a concurrent evaluation is proposed after the input stage. At the next laver, a
decision logic gate at the output port yields the optimum desired response.

XX"

Y3

Obviously, this simple design is robust, distributed and modular. It also provides mul-
tiple modular redundancy. In the next proposed extension, we introduce a model which is
essentially extensive in the temporal domain.

B. Temporal Extension

1. i -1
2. At cycle i, obtain (x i , yi)
3. While res(y-) < 6 do

3.1 i - i-i-1
3.2 obtain at the ith cycle
((xi,yj) I res(yi- 1 ) < 6, res(yi_ 2 ) < 6,... , res(y 1 ) < 6

22
3.3 compute res(yi)

4. Output (yi' I (xi- 1, yi-1), ... ,(Xl, yj))

Here the crucial element is the retrieval process specified in 3.2. At the ith cycle, the
scheduled retrieval process is in WAIT state until it is fired by the appropriate event at
the (i-1)th cycle, namely the arrival of the condition res(yi_ 1 ) < 6. The two schemes are
outlined below.

FO -'-0-

C. Multipoint Memory models flo() ,

Next level of extension is conjectured on a different basis. In this case, we form


memories based on pair-wise product form of original basis vectors 10->, 1 > as shown
below. We call this a two-point memory system, a special case of the more general mul-
tipoint memory system which we propose shortly.

M2= kl Vkl - OkI lIk


ki

% ere,

Oi= k(1)0 2 X 1 ,()


qk =--€1(1) Ok (2) ><xk (1) x, (2)1

In this scheme, one argues that

e Logically, the cue-vector I x k > is associated with the basis vector -" ni b:
with K with a joint probability aJ,

e However, closer the cue vectors kk- and , > are more is the error o;e ,
to make in this association. In view of probable misclassification, the weight or.
the initial proposition must therefore be reduced. This we do by considering the
counter proposal that one could have xk > associated with l> and N1 > with k-.
23
respectively, with a probability Ok, instead. Higher is our conviction in this
regard (higher bkI values) lower becomes the strength of our original hvpotheiis.

In this model, the matrix elements assume the forms

<ea(i) €a(2) IM xpp(1) Xq(2)> = area amp aaq - /ma a q

Given an arbitrary cue vector x, let us assume that res (y I x) < 6, and that within a
threshold limit the cue-vector x is similar to other cue-vectors whose logical associations
are, however, different. That is, we suppose an equivalence class associated with x

C = { (XY 1 Ix-xI), (x2 ,y 2 Ix x 2 ), ... , (xn,yn Ix'sxn) }

Given this, we would like to know how best one infers the corresponding associated vector
y with x. Let us denote the matrix element

<Om(l) On(2) IM2 Ix(1) Xq(2)> = (rn,,a, ,q)

with x xq> and xq is logically a )riorl associated with some y, via the fuinltion c.
the as. irtion that x is associated to some y ;> ,h 3 function Pm, the matrix element
(m,n,*,q) becomes a measure of the strength of this hypothesis. Therefore, to obtain the
optimum association, we compute
max (m, n, *, q)
m.n.q

and obtain the corresponding Om, whence we obtain the most logical association v via

Mr I Om>= Y

One could extend this notion of logical substitutability and reorganize the association space
accordingly.

This is carried out next in our three-point memory system model as indicated below.

M = P { (-) aklm I (1) €1(2) -Om(3)>} <Xk(1) xi(2) xm (3) I


kim p

where, P is a permutation operator with a parity p,


p= 1 2 3:
k I m!

Note that the right hand side is summed over all permutations. The matrlx eiements are
then
24
<r(1) €, (2) Ot (3) IM' IXk(1) x,(2) xm(3)> = P { ( 1 )P at ak a a
P

which then works out to

(r, s, t, k, 1, m) = arst arkasiatm - asrt a karlatm - arts ark atlam


" atsr atkaslarm + -str askatlarm + atrs atkarlasm

where (rs,t,k,l,m) is the matrix element in question. Note that the substitutability
rationale could be advanced in the same way as we articulated it earlier in the two point
memory system. If the string Xk is found similar to x, and xm then the strength of the
hypothesis concerned with the a priori association of Xk with Pk ought to be adjusted in
view of the ii-elihood of false memory assignments via kk, 41, and 0m. In our model, we
propose that with respect to the original configuration k(1) 01(2) 0.(3) every other
assignment of the form 0k,(1) Ok,(2) Ok,(3), where k1 , k 2 , k 3 is some permutation of the
indices k,l,m is either an -;nhibitorymemory association (reduces the strength of our initial
hypothesis) or a contributory memory association (advances the strength of our initial
hypothesis) depending on the number of transpositions required to transform {kl.m} to
{kl, k., k 3 }. That is, if there are q such transpositions required to bring the string {k.<.
into {k l , k,, k 3 } then the parity of the new memory assignment is (-1)l' as though every
single transposition is tantamount to an adjustment contrary to the direction of the origi-
nal assignment.
We consider the following diagram to illustrate the point that our model suggests.

P A I

S
7

0TT

Assume that on the energy 'hill", at the consecutive local minima, the di'ti8C: \:)-
terns are stored as shown by the points P, Q, R, etc. Suppose, also, that these patte-n-
are so close to each other that sometimes instead of retrieving the pattern Q from te
storage we recall some pattern R. It is pointless to suggest that we sharpen our pattern
specification more tightly with a view to minimizing pattern misclassification to some

25
tolerable level. Accordingly, we approach the problem in the following way. Indeed, on
some cue, let us assume, it appears that the most probable recall is, say, pattern Q. How-
ever, what if we had the patterns stored in the order say R Q P S T ... or P Q R T S
instead of the indicated order of storage on the hill P Q R S T ... Would that make an'y
difference though? Would we still return the pattern R when we were supposed to return
Q, instead? Surely, these alternative orders are also probable with some nonzero probabili-
ties simply because the patterns (R and P) in the first suggested list, and the patterns (T
and S) in the second list are conceivably interchangeable to the extent these few similar
patterns are concerned. But, then, why restrict ourselves to single-transposition lists?
Surely, if P Q R S T ... is a feasible list, then the list obtained by double transpositions
such as(PR) (PT) {PQRST...}={TQPSR .... is also a probable. However, we
make a somewhat qualified claim now. If the storage order in the first approximation on
the 'hill" seems to be the list P Q R S T ... coming down the hill, then a storage order
implying a p-transposition on the original order ought to be less probable than the one
with a q-transposition, if q < p. In other words, in our scheme of approach, the one point
memory model is the most optimum first-order AM/CAM model we could think of. Its
improvement implies a model in which interactions among patterns cannot be ignored any
more. The two-points memory system, and then the three-points memory systems are then
the second and the third-order approximations, as indicated, with more and more pattern-
interactions taken into account.
Therefore, one could suggest a plausible model in which we carry out storage and
retrieval of patterns as follows. We assume that AM/CAM memory need not always
remain static. We, instead, suggest that such a memory should be dynamically restruc-
tured based on how successful one continues to be with the process of recall. We assume
the memory system to be in state -y, that is its current memory is so densely packed that
at time t it is a -f-point memory at some level p (p is an integer less than or equal to -1) as
indicated below

>
Mp = E EP {(-1) aklm...q Ikk(1)01(2)Qm(3) ...0(nj)
klm...q p
<xk (1)xl(2)xm(3) " " Xq(n.) I
If recall of similar patterns, in this memory, gives us a hit ratio less than soue thres-
hold parameter p. then we should restructure the memory as a M'1. 1 element, and con-
tinue using it. This, probably, is what does tend to happen in using biological memory.
Whenever we tend to face difficulties in recalling stored information (with a presumed
similarity with other information strings), perhaps, even consciously at times, we restruc-
ture our memory elements focusing on other nontrivial features of the patterns not con-
sidered before.
Acordingly, one could approach the storage/retrieval policy as follows. Assume that
the system is at state -1, i.e. its memory is at most M"". Given an incomirg pattern x.
system tries to return a vector V using the simplest memorn M t in w: h intcr :,:
among the stored patterns are ignored. If, within some threshold parameter 67. it Fnds_.:
that it could, instead, return any one of the vectors in the set S? = { z . z,'} then it
considers its memory to be logically the M7 and attempts to return 7F with less uncer-
tainty. If, however, it finds that, within some threshold parameter V, it could, instead.

26
return any one of the vectors in the set S3 = {z1, z ... ,Zr} then it considers its memory
organized as M and continues with the process. We assume that, while the system is at
state -y, the memory could be allowed to evolve sequentially at most to its highest level
M.
What if, the system fails to resolve an input cue vector even at the memory level
Then, we could reject that pattern as irresolvable and continue with a fresh input. The
system could be designed to migrate from the state -y to the state -1+1 if the number of
rejects in the irresolvable class mounts up rapidly. Otherwise, we leave it at the state -1,
and on each fresh cue ' we let the system work through M7 - -- -- ... -- M as
need be.

References

1. Abu-Mostafa Y. S. and Jacques J. S. 'Information Capacity of the Hopfield


Model', IEEE Trans. Information Theory, IT-31, pp 461-464, July 1985.

2. Bak C. S. and Little M. J.: "Memory Capacity of Artificial Neural Networks


with High Order Node Connections", Proc. IEEE Intl. Conf. Neural Network--. I.
pp 1207-216, July 1988.

3. Chiueh T. and Goodman R. M.: 'High Capacity Exponential Associative


Memories', Proc. IEEE Intl. Conf. Neural Networks, 1, pp 1153 - 1160, July 1988.

4. Farhat N. H. et al. 'Optical Implementation of the Hopfield Model', Applied


Optics, 24, pp 1469-1475, May 1985.

5. Fuchs A. and Haken H.: 'Pattern Recognition and Associative mor': -


Dynamical Processes in Nonlinear Systems', Proc. IEEE Intl. Co=f. Neurai
works, I, pp 1217 - 1224, July 1988.

6. Hinton G. E. and Anderson J. A. Parallel Model of Associative Mernor'.


Lawrence Erlbaum Associates, Pub. Hillsdale, New Jersey, 1981.

7. Hopfield J. J. 'Neural Networks and Physical Systems with Emergent Collec-


tive Computational Abilities,' Proc. Natl. Acad. Sci. USA, 79, pp 2554-2558.
April 1982.

8. Kohonen T. 'Correlation Matrix Memories', IEEE Trans. Computers. C-21. oo


353-359, April 1972.

9. Kohonen, T. Associat te Memory. Springer-Verlag, New York 1177.

10. Lee Y. C. et al: 'Machine Learning using a Higher order Correlation Net-
work', Physica, 22D, pp 276-306, 1986.

2-1
11. McEliece R. J. et al, 'The Capacity of Hopfield Associative Memory", IEEE
Trans. Information Theory, IT-33, pp 461-482, July 1987.

12. Sussman H. J. "On the Number of Memories that can be Perfectly Stored in
a Neural Net with Hebb Weights', IEEE Trans. Information Theory, IT-35, pp
174-1178, January 1989.

13. Oja E. and Kohonen T. : 'The Subspace learning Algorithm as a Formalism


for Pattern Recognition and Neural Networks, Proc. IEEE. Intl. Conf. Neural
Networks, I, pp 1277-284, July 1988.

28
Models of Adaptive Neural-net based
Pattern Classifiers/Recognizers

A neural network based system implemented as a collection of elemental processors


working cooperatively in a functional setting in specific problem domains like controlling an
unknown dynamical system autonomously is ultimately a learning entity that evolves
through some form of self-organization. Learning is mediated via recurrent sampling on an
open event domain based on the paradigm of 'learning by ezamples", whereby the pair-
wise connections among the 'neurons', the synaptic weights, asymptotically tend to
approach a globally stable equilibrium state on the assumption that the relevant training
set itself remains stable during the learning phase. In this framework, an ideal neural net-
work type system is functionally equivalent to an adaptive pattern recognizer whether it is
designed to function as a front-end image-compressor system via an appropriate vector
quantization of images, or as a system that yields optimum or almost optimum tours of
Traveling Salesman problems. It is in this regard that neural network models have had
been most extensively studied - as adaptive pattern classifier/recognizers, as systems that
could autonomously learn and function in an unknown problem environment.
Accordingly, it is this specific area of clustering and pattern recognitoi in vn w:Ic-
major research thrust continues today (4) in one form or another. The problem here is pri-
marily two fold: (a) identification of an appropriate learning (or mapping) algorithm, and
(b), given (a), identification of an implementable architecture that reflects a functionally
parallel distributed system and support of such a system functionally at a high-
performance level. A neural network type system emerges as a powerful high-performance
machine because it is inherently concurrent at instruction and at task levels: any
compromise with this specific feature would surely lead to a precipitous drop in its perfor-
mance. Given this imperative one could alternatively approach the problem from the oppc,-
site angle: Obtain an appropriate learning algorithm given that the processing architec:uro
must be parallel-distributed with a maximum exploitation of system level concurrency.
The question of validity of neural network models is no longer the issue. The per-
tinent issue now is how to implement what need be implemented, i.e. the realization of a
feasible and a promising learning algorithm, and the attendant system level architecture. In
this paper, the framework is set on the assumption that learning categories of patterns by
an unsupervised collection of automata in a distributed computation mode is where the
research interest is focused. Accordingly, this is the area we concentrate on this paper.

Patterns over Feature space

Objects of interest, through appropriate feature extraction and enci::,. ar,,


venientlv mapped onto patterns with binarv-vaiued or ana.g :'eais. n,-,.,,
over feature vectors could be ma(:e associative with the corresponiing ,h;,,' w ic c
stitutes the notion of equivalence classes in the following sense

... ,xvi)T
e A pattern xi = (x1i, xn.i,
29
is in class cA if the corresponding class-exemplar 5 is
near to xi within a tolerance distance (or an uncertainty) of SA, i.e.

d(xi, xA} < A


where,
A = index (min {d(xi, x5,)})

where d(.,.) is some suitable distance measure on the metric spanned by the pattern vec-
tors.
In an unsupervised environment, the emerging class-exemplars or the cluster proto-
types in a multiclass domain could be made to obey some additional constrains. Assuming
that we do not a prioriknow their distributions except the requirement that their mutual
separation must be at least as large as some threshold parameter, one could require that

_>A,5O)
Pmin
The idea is that at some distance or lower a pair of clusters may lose their individual dis-
tinctiveness so much so that one could merge the two to form a single cluster. Precisely
how Pmin ought to be stated is a debatable issue, but, it could be made related to the
inter-cluster distance (ICD) distribution which we may conjecture to be a normal distribu-
tion with a mean and a variance a2. Accordingly, we could let

PMin = P + P ap

where p is some number around 1.0, and

a is the variance of the ICD-distribution N( , c2), assumed to be small. The problem is.
that we usually do not know this distribution a priori. For an unknown pattern domain, it
may not even be possible to predict a priori in how many distinct pattern classes should
the pattern space be partitioned. All we know is that more separated the equivalence
classes are more accurate would one be in ascertaining that a specific pattern belongs to a
specific class.
Ideally, clusters ought to be well separated on the feature plane where the patterns
are recognized. This is desired by a high value of . This could be achieved by designing a
feature space in which small differences at individual pattern levels could be captured. This
is a design issue, an encoding problem which could be separately tackled to an arbitrary
degree of performance. This issue would be addressed in our subsequent papers. However,
there is another problem, which we should look into. This arises in situations where the
order of input matters. In a functional system, we do want the pattern-classes to be
i.e. we want the ,ntra-cluster distance measures to be as small as possible so that we cou,_..
have crisp clusters and retain them as such. This means that the variance measures or, the
exemplars or the average coefficient of variation of an exemplar ought to be small. But
then, is this, or should it be ever under our control? The pattern-sampling that takes
place during the training period may include event instances so noisy and incomplete that
30
for all practical purpose they could be considered garbage. Should we let the system be
trained with such feature vectors? One could suggest that if on inclusion of such a feature
vector the class prototype is perturbed more than one could tolerate then one could aban-
don the vector altogether. However, this means that to be reasonable one should make the
level of tolerance function of number of patterns already accommodated within the class.

Some Neural Net Classifier Models

In this section, we conjecture some models which are obvious extrapolation of some
neural-net based models already advocated. One could consider a known classifier model
and expand it to suit a specific architecture, in which case one starts with the advantage
that at least results up to the preextrapolated modeling state is known and could be con-
sidered as a reference basis for further studies. However, we have some problems in this
regard. It is possible that

* without an extensive simulation and actual real case studies, one may not be in
a position to confidently use these extrapolated models, even though, they do
appear fairly plausible at this stage,

* no neural net model has been so extensively studied to understand the extent
to which the model is applicable under realistic noise and measurement oriented
issues,

e no research result is available to indicate to what degree the


classification/recognition problem, in the domain of neural net paradigm, is sensi-
tive to pattern symmetries particularly when one deliberately expands the feature
space by systematic topological operations such as folding, contraction, etc. Note
that such topological operations do not, in general, reduce the inherent entropy
measures of patterns studied, but it could provide some redundancy on the
metric space which could be exploited. It is not necessarily obvious that such a
topological restructuring is always possible or even desirable for feature enhance-
ment.

Accordingly, we propose the following. Of course, this has to be thoroughly tested,


particularly with highly symmetric patterns and with those contaminated with noise, but
we could extend the feature space from an orthogonal v-dimensional feature space to a L-
dimensional general feature space contemplated as follows. Given a pattern on the original
vector space as a tuple {xi), we consider it equivalent to the nonorthogonal expansion in
the following form

Original pattern P = {xi}


Equivalent pattern on the extended space, Peqiuv = (xi, xjxj, i * j, xiXxx k , i
*k, ... }.

For classification/recognition purpose, we assume Peuiv = P. The additional redundancy


31
introduced by this artificial encoding would not be detrimental; on the contrary, it would
assist us to discriminate patterns successfully in a parallel-distributed processing scheme.

A Localized-Distributed ARTI Scheme:


In this scheme, the basic ART1 model of Carpenter & Grossberg (1,5) that deals with
binary feature vectors is considered as the reference system for developing a workable
localized-distributed model for feature recognition/classification. This is outlined as fol-
lows.

(a) Given feature vectors ix on the original v-dimensional basis, we otain,


for each pattern, the extended, but equivalent, feature vector -aas indi-
cated earlier. The entire a-pattern is input to a group of processing unit
modules each of which comprises a number of neuron-type processors.
Thus, logically, the pattern vector ' is captured by, say, N PUM (process-
ing unit module), each PUM comprising, say, n neuron-type simple proces-
sor elements receives, at time t, a specific partition of the -a pattern for
processing. Thus, the Ith PUM would receive, at time t, the following
ordered input string for processing

o-I(t) = {a',, 2 .' ,.

This is as shown below.

O---

Input Preparation Stage


(b) For each Ith PUM, we have the local group vigilance parameter p,
which is a variable. For each pattern class or exemplar j, similarly, we
have a separation measure 7rj reflecting the minimum tolerable distance
the other classes must be at given the pattern class j. Initially,

Pi = P-o- [1 - e-ff'la~l u(P. - Po, 0)

and

32
J>

where u(x,y) is a uniformly distributed random variable between x and y,


both inclusive, and f(m) is some appropriate monotonically increasing
function on the PUM size m, the number of elemental neuron type proces-
sors it contains. Also,

p =min {pi}, and pc = max {pi}


I I

(c) each PUM, working independently from each other, attempts to iden-
tify the best exemplar its input pattern corresponds to. We assume that
there are, ji + 1 distinct classes to choose from. One of these, the refuse
class, would correspond to those patterns which are too fuzzy to belong to
any of the other A, classes. The exemplar weights are stored in the b-
matrix. The dot products of its input pattern with the exemplar patterns
are computed at this level as

A _ blji ali

where, the index i is over all neuron-type processors comprising the Ith
PUM, and the index j points to a specific class or pattern exemplar.

(d) The tentative cluster I(j') for the Ith PUM conglomerate is obtained
as
1(j') = index (max {AI j})

If alj. is less than &' then the pattern goes to refuse class.

(e) For each PUM, we obtain al as given by

tI(j')i oi
I
ai -

(f) If a, _ 01, then we deactivate the suggested class I(j*), as shown by


Carpenter & Grossberg, and go back to (c). Otherwise, at the earliest
possible time, we post I(j) at the next level up with its (or - Pl) value.

(g) We next compute the frequency of occurrence rn(j*) of the exemplar j'
as the indicated pattern choice at the subpattern levels, and obtain
(al - P1) Thr (a1 - P0)
19j)= m(I(j')) Z
P1
33
where, Thr(x) = 1 if x > 0.0, otherwise, 0.

(h) The most suggested cluster center is j where

j = index (x ))
J

(i) Synaptic weights (both top-down and the bottom-up weights) are
updated if 8(j") > -y, some threshold parameter. Then, referring to time
instances by the label 1,we have,
tlj'-i(l+l) tj'-i(l) a n
ti
t'j i (1+1)

0.5 + E t1 j; (1) on

()If the weights ., pdated via (i) then on all PUM nodes I with I(j 5 )
as its choice, w - .ate

pIj- 1=avg {pl (1)}


I

else, we update them as

pl (I+1) = max
I {p p(1)}

The algorithm as outlined here is an extension of the basic approach articulated as


,ART1 algorithm of Carpenter and Grossberg (3). The idea here is to discover the global
:luster structure via cluster formation at local level. If any such algorithm could be out-
lined, then it is of immense help to us since any pattern structure could be, in principle,
partitioned into a number of parallel processor unit modules each of which would then be
required to identify locally at its level whatever distinguishing pattern it can see.

A Bland-net Classifier:

This approach, though lacks depth, is useful as an alternative cluster determination


procedure. In spite of the fact that it is presented as a centralized scheme, it could be
easily restructured as a localized-distributed scheme as outlined in the previous section
with the ARTI type algorithm. Its basic architecture is a single-level, or a single-layer ii.)-
tiple neuron-type processors processing the input string, and a decision-unit, sitting at its
top working on the output from the neurons below. This is outlined below.

(a) As in localized-distributed ART1 scheme, extend the input feature


vector "to a feature vector '.The input vector is now the ordered string

34
F(t) = {at1 , at 2 , ... ,t }

(b) Assign random weights to the synaptic elements bji. Note that the ele-
ments bji E [0,1].

-w-
A Bland-net System

(c) On an input pattern to be classified, compute pattern-pattern class


and pattern class-pattern class distances over the Mahalnobis metric with
a positive value for the A parameter
1
Dj = I .(bji - o )'I

Qj = min (bji - bk) 2


k

(d) Compute j = index (min {Di}). If Dj' < 6, then obtain

j = index (rn {-}" Pattern a is n class j.

(e) Update synaptic weights bj'i if the last pattern is included in this class
as

bj- i ( nj" + 1) - ni"ni"


+ 1b' +nj" 1+la

where nj" is the number of patterns already accommodated in the clusTer

Classifier Architectures
3S
Training a system to the eventual task of correct pattern-class identification on a pat-
tern space may be order-sensitive in some cases. This, of course, depends upon the algo-
rithm used to identify clusters, and the way the training patterns are introduced at the
input and processed by the system subsequently. If the system at the task level is essen-
tially sequential in the sense that the individual patterns are sequentially admitted and
processed one at a time against the already acquired knowledge, tentative though may be,
of likely class-distribution, which, in turn, is expected to be incrementally adjusted given
the current new input, we could get an order sensitive distribution eventually. Ideally, the
input patterns, in its entirity should be concurrently processed as one single whole without
any prior reference to any distribution, so that after a given numb number of cycles, we
would obtain a stable, crisp, unbrittle distribution of the sampling population significantly
close to the population distribution in the universe of discourse. But, if the patterns arriv-
ing at an input port are only sequentially admitted as units of a single time-dependent or a
time-varying input stream, we cannot arbitrarily keep them on hold in some temporary
buffers to be released at some convenient time for concurrent processing in bulk, unless the
input rate is very high compared to processing rate.
To understand the overall dimension of the problem in this regard, we consider some
simple processing models. Suppose, the patterns arrive at a rate of Ap patterns/sec at an
input port obeying a Poisson distribution as follows. The probability that the number o:
arriving input patterns during an arbitrary interval of length t units is n is given by

-t (Apt)n
= n within t) = e
p(number of inputs n!

Suppose, in the first model, we have all the incoming patterns temporarily queued in a
buffer (infinitely long) till they are serviced one at a time in a FIFO discipline by an
appropriate classifier (using, say, ARTL algorithm of Carpenter & Grossberg (3.51) in
which the average individual pattern processing time is 1/',u sec. We assume that the c~a,-
sification process is exponentially distributed, that is the probability distribution of the
classification/recognition time is given by the

f(s) =i e sI S >_0
- 0, otherwise

In this case, the relevant system and pattern traffic aggregate measures emerge as follows
(2).

P
L, the average buffer occupancy

W' , the average pattern-residence time in the system - I

°, the average pattern time in the buffer =P


W'

36
We should strive for a system in which L° , W ° , and W" 0 are all small.
We could improve the system somewhat if we process the patterns con urrently by a
set of multiple classifiers. We assume that they are all equal in performance, i.e. capable of
classifying a single pattern in 1/, time on the average. There are at least two distinct
ways to process the patterns now.

Processor Pool Model A:

We assume that the system, in this case, comprises k identical processors in a proces-
sor pool to process incoming patterns in FIFO mode service. In this case, all the incoming
patterns wait in a single queue served by k identical processors in the M/M/k service
mode. Whenever one of the k-machines becomes idle, it picks up the first paLtern it finds
at the input queue and attempts to classify it. The system organization is indicated below.

-- ~alf IleI-.-0

A Processor Pool based
Classifier Model

Given thu. the arrival rate of the patterns into the system is still A, patterns s<ec..
and the service rate at any of the k given classifiers is also Poisson with a mean rate o:'.,
and that Ap < k ji, the equilibrium probability Po that the system is idle is given by

po=k-I ( AP )' + T.,At )T


kIs
i=o AC c k! -

and, the equilibrium probability Pi that the system has i patterns in it is given by

A iL c )(P0 , for i < k

= (AP )ikkp 0 , for i>k


k! k g

in terms of which the system aggregates could be computed as

A A
Po( PN
Lq A
k! [1 - (k"P )12
where, Lq is the average number of patterns in the processing queue waiting for one of the
k processors to process them, and,
wA~ L 1
W A +-, and
LA =Ap WA

where WA is the average residence time per pattern in the system, and LA respectively is
the system occupancy rate in the model A.

k-Parallel Stream model B:

In this case, the input stream is split into k streams each of which is then looked after
by a single processor with a processing power, as before, at a rate of completing on the
average of one pattern in every u, second. This is schemetically as shown below.

Parallel Stream or SIMD based design

At each arriving pattern stream, we assume, the interarrival time between any two
consecutive patterns, on the average, would be exponentially distributed with a mean of
kA sLice upon being available for processing it would be sent
AP
down to one of the processor
queue with a probability of 1/k. In this case, the system aggregates are as follows

LB AP
kA, - Ap

with the average residence time per pattern as

W _- k
kg, - Ap

In either case, whether we provide a bank of multiple servers for a single queue pat-
tern traffic or a multiple port service with each input port being serviced by a single pro-

cessor, the net result is the reduction of residence times, since both L and Li are less
38
than 1, for k >1.

A Quasi-Array Processor Model:

Here, we assume the system to be in one of the two mutually exclusive states. We
assume that it is either at a learning state A whereupon it trains itself incrementally with a
stream of incoming feature vectors on the assumption that these vectors do ccnstitute a
set of legitimate patterns from the population of patterns in a statistical sense, or it is in
state B where its learning has stabilized to the extent of knowing precisely in how many
pattern classes the pattern domain is to be partitioned, and also, to a large extent, how the
corresponding pattern exemplars look like.
Assuming the system to be in state A, that is in the learning state, we let the incom-
ing feature vectors all collated in a time ordered sequence at the input buffer B serviced by
a controller. The controller, periodically, picks up k feature vectors from the buffer and
gets them shuffled at a Permutation block P such that the input string of vectors M'(t),
i'(t~l), ...,i'(t+k-1)} is changed to a random sequence of vectors of the form {-(tk), ji(tl),
...,i 4 (tm)} at the output of P. The controller then dispatches these vectors to the classifiers
at a rate of one vector per classifier per dispatch event. The classifier bank comprises k
independent classifying units. Each classifier Ci upon receiving the vectorT(t;) attempts to
learn from it using one of the algorithm outlined. After learning/classifying m consecutive
patterns, each classifier Ci sends its exemplar profile map to its output node Oi where the
output exemplar-profile corresponds to the pattern-class distribution for the last m
independent feature vector events. The output profile at Oi is a string of vectors along
with their pattern counts i.e. the number of patterns it has accommodated in a given class,
as shown below.

th (1) = 1, I), (Z', m'), ... , (Z'h, mh)}

with mj = the cardinality of the class j given m patterns in total

indicating that at the end of m consecutive pattern processings the plausible partition of
the feature-space is a h-tuple as shown above. The local output node Oi sends this feature
map to the next level up at the node Y where all such k independent feature maps arising
from below are consolidated with the consolidated feature map obtained at the last cycle.
If, on the other hand, the system is at state B, that is in the state of functioning as a
pattern recognizer, the controller sends the input pattern to be recognized directly to the
processor Y in an M/M/c processing mode assuming the feature vector arrival process at
the input buffer is Poisson and the pattern recognition time at Y is exponentially distri-
buted, respectively. The service times at the processors Ci are all assumed to be exponen-
tially distributed.
We assume that the switching over from the state A to state B takes place once the
system is triggered by one or more of the following conditions. It is

o The input stream is punctuated by a time gap. After the time-out,

39
the system migrates into the state B.

e The class-exemplar profiles at Y over last q updates involve


exemplar-pattern perturbations whose maximum is lower than
some threshold parameter.

Note that at any sampling event time during or after training episodes, after some minimal
time point to, the node Y has a specific exemplar profile. Even if one or more classifier
type processors fault the system is globally fault-tolerant. The basic structure is indicated
below.

Note that each elemental concurrent processor Ci could be, at some level, a feedback
processor, as shown in the diagram, in the event the pattern-class corresponding to an
input pattern requires further processing.

Consolidation of local distributions


into a global pattern-class distribution

At the node Y, the global pattern-classification emerges asymptotically. That is, if


the claw-distribution profile at Y at time t=l is given by the distribution Oy(t=l), then
limit Oy(t) = , we assume, does exist and constitutes the desired class-exemplar pro-
t - 00
files. We assume that at the end of some computational cycle I, the individual class-
distribution profiles together with their local frequency of distribution { 1 < i <k }
I(),
are amassed from the output nodes {Oj} below. All these are reconciled, at the end of the
Ith cycle, with the previously computed global class-distribution oy( I-I) at Y as foilows.

Oy(1) 41(1)
0-
k
S(1) = (y(1-_) U ( U V'() 0
40
and, given two class-distribution profiles
--- t -- --t

4'i (1) = {(Z'I, i 1 ) 1 (Z', m':), ... , (Z'h, mih)}


V'j (1) = {(Z' 1 , mI), (Zj 2 , mi.), ... , (Z'k, mk)

we define the "union" operation of the above profiles as


(l)z =( iqi+)
)i(1) ={ { ~+, (mq + m r) Id(z'q,zir) < 6 }

where, d(.,.) is the 'distance" between the exemplar patterns in the argument. One could,
if necessary, condition the union operation further by requiring that two classes may not
be allowed to coalesce if the resultant class were to emerge as too big a class by size.

Performance Profile

The above model, as proposed, attempts to provide a 'bias-free' architecture via con-
current multistream computation of the global class-distribution. If the patterns or rather
the feature vectors, during the training period, arrive into the system at a Poisson rate of
Ap
AP vectors/sec, each processing stream then receives input at a rate of k vectors/sec
which gets processed with an average residence time of W sec. per pattern, where

1 k
Aiy k c - Ap

where, Ay is the average processing rate of feature vectors at the level Y.


Note that, in some cases, particularly in Carpenter/Grossberg-tvpe a!gorithms, the
internal processing by the elemental processor Ci may include feedback loop as shown
below. This is due to the fact that sometimes a tentative class exemplar may not resonate
at a specific class in which case one has to deactivate the class temporarily and search for
an alternative choice by going back into the algorithm at least one more time.

An Internal Classifier with a Feedback Loop

Assuming that with a probability 0, a task returns to the end of the buffer for at least
one more time of processing, and with a probability of (1 - 0) it departs the classifier, one
41
obtains, at equilibrium,

A
+ (I-eO)I)p, = - + (1-O)Apn+I, n > 1

where p, is the equilibrium or the steady-state probability that the subsystem at the Ci
processor stream has n tasks in it (including one being processed, if any). The equilibrium
population densities come out to be

Ap n
Pi k~l_ ) (1 - n> 0

and the steady state system occupancy rate Li comes out as

Li E oo-k(1-0)pcI\P A- Ap
00 npn
1
0

while the expected response time of a task in this stream, Wi, comes out to be

Lik

A Multidepth Pipeline
Classifier Processor

Furthermore, one could extend this architecture to a multidepth pipeline organization


indicated below in which at a certain depth either a feature vector is processed to the
extent of knowing in which class it best belongs or it is not, in which case the pattern vec-
tor is passed onto the next level classifier working at a finer-grain resolution at a depth (.
- 1). Assuming that with a probability of ( at a level a feature-vector is referred to the
next level processor, and with a probability of O it loops back for further processing at the

42
level , and assuming that, for all practical purpose, the system is decomposable as multi-
ple M/M/c subsystems, working in tandem, we have the effective pattern arrival rates in
each subsystem at the depth c as, at equilibrium,

Ap
A k(1-00)

in terms of which, the expected residence time of a pattern task, at each pipeline section
comes out to be

1
W =

Note that an infinite variety of parallel distributed architecture for neural-based net-
work systems could be proposed. In almost all cases, multiple neuron-type processors are
best suited to provide SIMD type architecture at some computational level. Organization
of machines based on SIMD type primitives as indicated above yield substantial perfor-
mance.

References:

1. Lippman, R. P. 'An Introduction to Computing with Neural Nets', IEEE


ASSP Magazine, April 4, 1987.

2. Ross, Sheldon. An Introduction to ProbabilityModels. Prentice-Hall Publishers,


1980.

3. Carpenter G. A. and S. Grossberg. "ART2: Self-organization of stable category


recognition codes for analog input patterns's.'Applied Optics, 26, pp 4919-4930,
1987

4. Duda R. 0. and P. E. Hart. Pattern Classification and Scene Analysis' Wiley,


New York 1973.

5. Carpenter G. A. and S. Grossberg. 'Category Learning and adaptive pattern


recognition, a neural network model.'Proc. Third Army Conference on Applied
Mathematics and Computing, ARO Report 86-1, pp 37-56, 1985.

43
DISCRETE FOURIER TRANSFORLMS ON HYPERCUBES

This paper presents a decomposition of the discrete Fourier transform matrix that iL MOD

explicit than what is found in the literature. It is shown that such a decomposition forms

an important tool in mapping the transform computation onto a parallel architecture such

as that provided by the hypercube topology. A basic scheme for power-of-a-prime length

transform computations on the hypercube is discussed.

The use of the Chinese Remainder Theorem in pipelining the computation of the

transforms, for lengths that are not a power of a prime, is also discussed

1. A DFT matrix decomposition

Recall that the discrete Fourier transform Y of length n of a vector X of lerigTh I-

defined by

Y=FX

where F is the square matrix of order n given by

F = [wki

with 0 < k, j < n - 1 and where W is the nth root of unity

W = e - 27ri/n = cos( 27r/n )-i sin ( 2-tin ),

with i2 - - 1.

It is claimed that if n = 2", then the DFT matrix F is, up to a permutation matrix, the

product of -y = log n sparse matrices


F =-Fo0FIF 2 .. F .2 F -f1

where each F • is a block diagonal matrix

F i = diag(F i(oF .1 F .. ,i)

and the following properties hold

(i) each block F -. i'a is square of order 2 (--i)+1= n/2 i'i and is of the form

44
I WrI
F-=]
-iwa
I WV 21

and there are exactly i-1 blocks; I is the identity matrix of order n/2;
2

(ii) for each block, r is the nonnegative integer whose -t-bit binary expansion is the reversal

of that of 2a.

The proof of this decomposition can be sketched as follows.

We write the components of Y and X as y[k] and x[ji, for 0 < k, j _ n - i, so that

n-I
y[k] = F x[j] W!
i=0

We then write the indices k and j in binary, using -ybits of course, i.e

k= k Ik 2k_ 3 *..
k2kIk o

j = jly,7.2j3.3 ...
J2 Jljo
In decimal,

k = 2- 1k + 2- 2 k.
2 + 2k 1 + k0
1j -f + 2 - 2
j = 2 ^'" j-12 + 2jI + Jo

Hence, with indices written in binary, we get

y[k] = y[ky-tk.7- 2 ... ktkol

x~j] = x[j.tjT_2 ... ljol


wki - W(2' T- k-.-. +2",- 2
k- 2 _ +2k* +ko) (2-'j.- -2"-2j-

1 1 1 1
Y[k*-- ktk] = ( ( E
('" ( x .
J,2 ..-
JJo]"
0
jo=O j = j-,- --O j._=O

w--) W-7_2) ...


)wi) WO
45
where

Vi
7 ..1 =W(2'k'Yi +2-' 2 k-. 2 4- +2k1 +ko)2'j-,-

2
=1_W(2-'k, 1+27 2 k-r2+ -- +2k,+ko)2- j,- 2

- -k._12.k.-2 +2k 1 +ko)2j 1

-O W2-k?12-ky2 +2k 1 +k0 )jo

Since W2y W = 1, we actually get

w~
2 wZ7(2k1+ko)j7-2

w - (2-Y2 k-,-2+ . +2k1-+ko)2j,

-O W(F', +72 ... +2kl+ko)jo

The computation can then be carried out in stages, where at each stage we calculate an

intermediate vector

x.f-
Note that in our notation, the first vector to be computed is X, and the last is X0

46
From the expansion of

we see that the components of the vectors X-,- and Xi are related by the equat ions

X x 1 ~ko l~ 1110]k-j-iI--i2..

=yil k ... ki..3ki..0 hi-y-A--i-2 .1..1j0]

X,7-i+[ko ki-kA-.2 1 i,1-i-11A-i-2 . 1..1j01

W2"- (2'-Ik,-j+2"k,_Z+...+ko,) .

= 7 .4+1[kO.. k,... 3 ki-.2 0 hy--Ij-j-i-2 ... ] +o

X7 .;i+Uko k...k,....2 1 j---J-,-2 . 110j

IN2w 2 Ii1+'2k-+.+o
i.e we may write

X.=t- F..i'X.

for some appropriate matrix F t

From the last expression above, we see that the matrix F-- can be blocked into 2 i4I

* submatrices. Each of the submatrices is determined by a unique %valueof tie hits Othu I'

more significant than the (I-1)th bit, namely

k0 k 1..- ki 3 ki 2

47
We use these bits as our label a for each block, so the decimal value of a is

2 i 2k 0 + 2 i 3 k 1 +... + 2ki_3 + ki_2

The size of each block Fq i'a is determined by a unique value of the bits that are less

significant than those of the label a. We can see that there are "-i+1 such bits, hence the

size of Fqti' a is

2t ' i+ l = 2/ 2 iY = n/2i- 1

We next need to show that F tfi,a is given by

F-t-i,of= n

For a fixed a = 2i k0 + 2i k1 + ... + 2ki. 3 + ki12 , the two possible values for ki_

the two halves of the matrix F

Indeed when k = 0 then

X _;i[ko ...ki-A -i20 i-r-i-li-y-i-2 ... iijol=


X -i+k "'... ki- i-2 0 17-i-li-1-i-2 " 'l=o]

Xk.-i+. [ko ... ki_3 . 2 1 JTi -lJi 2 lJo]"

Wv2'-(0+2'-2kj-2kj-3+ ""+ko)

Now the submatrices I and WrI, where I is the identity matrix of order n/2' are evident.

where r is given by

r =2 k i-2 2 k i-3 -... .2k1 k0 )

A similar analysis for ki. = 1 shows the two submatrices in the second half of F

48
It remains to show that the -- bit binary expansion of r is that of the reverse of 2ct.

Note that

r 2"1-2ki2 + 2"-T3 ki 3 + ... + 2"-+'1 k, + 2-i'ko

a = 2'- 2 ko + 2'- 3 k, + ... + 2ki- 3 + ki 2

2a = 2;-ko + 2'- 2k1 + ... + 22 ki_ 3 + 2ki-2

The -bit binary expansion of 2a is then

2'y- t • 0 + 2'y- ' • 0 + ... + 2' • 0 + 2i-lko + ... + 2'-'k, + .. +

22ki-3 + 2'kki- 2 + 20 • 0

Reversing the bits of 2a amounts to making the following substitutions

2 t ki-2 "" 2"--'ki-2

2 kj_3 - 2"-3 ki_3

2' - k0 --- 2(T')-(' 1-)k 0 = 2"-iko

which yields the number

i2i-3 3 +
2 -f2ki 2 + 2 -t' k + ... ++2tk0
2 'k

which is indeed r.

49
Note that because of the way the indices are ordered, at the end we get

y[kk...klkol - xorkokl...k k___

Hence
F RR2 -yFoF IF 2 .. F -y.2F .f1

where R2 -Yis the permutation matrix corresponding to the permutation of the numbers
0, 1, ... , n-1 = 27-1

defined by reversing the bits in the binary representation.

2. The hypercube topology

The decomposition of the DFT matrix into a product of sparse matrices, as shown in

section 1 provides the essential tool for mapping the computation of the DFT on

multiprocessor systems. In this section we examine the case of a hypercube.

The hypercube topology has now appeared in successful commercial multiprocessor

systems, including the Intel's iPSC, Amatek's S14, Ncube's NCUBE systems and the

Connection Machines from Thinking Machines. The hypercube has a simple recursive

definition:

(a) a 0-cube, with 20 = 1 node, is a uniprocessor;

(b) for d > 1, a d-cube, with 2d nodes, is obtained from two copies of a

(d-1)-cube by connecting each pair of corresponding nodes with a

communication link.

Note the very natural labeling of the nodes of the d-cube: simply precede the labels of the

nodes from the first copy of the (d-1)-cube with 0, and those from the second copy with a

1 and connect nodes that differ only in the first bit.

©/
1
0-----------------------------------------------

0-cube 1-cube
50
DCO,

110

10010 Oi1

2-cube 3-cube

010- 01

4
-cube

The hypercube is attractive for several reasons, including the following

(a) many of the classical topologies (rings,trees,meshes,etc.) can be embedded in

hypercubes (hence previously designed algorithms may still be used);

(b) the layout is totally symmetric (by the way, the hypercube appears among proposed

architectures designed from finite algebraic groups);

(c)the corrmunication diameter is logarithmic in the number of process.or>;

(d) the topology exhibits a reasonable behavior in the presence of faulty processors w

communication links;
51
there are several paths connecting one node to another; in fact the number of disjoint

paths of mininum distance from node A to node B is the tfamming distance betweer tle

(binary) labels of A and B;


d
(e) a hypercube is Hamiltonian, i.e a ring with 2 nodes can be embedded in a d-cube;

(f) many algorithms for diverse fields such as the physical sciences, signal processing,

numerical analysis, operations research and graphics have been successfully designed for

the hypercube.

3. Example of a transform computation mapping onto a hypercube

A decomposition of the DFT matrix into a product of sparse matrices is visualized, in tli

usual manner, by means of a signal flow graph. For the length n = 23, the decompoi. MI

of section 1 yields the following graph,

X Xz X1 Xo

- <
001

aI W w*

100 X

101

113 Z

wy W 0

where the unscrambling of the index bits is not shown. Each column of nodes in the graph

52
corresponds to an intermediate vector X y- of section 1.

Suppose now that a 2-cube is available. Looking at the signal flow graphi, we call alloc au

the first 2 components (indexed 000 and 001) to processor 00, the next two components

(indexed 010 and 011) to processor 01, etc. In general two components indexed abO and

abi are allocated to processor ab, as shown in the following figure.

- W

001

WW

W N N
w

W
NW
W -, N

- - a
/W N.W

We see that the computation requires 3 steps (if we ignore the bit reversing); during the

first 2 steps, interprocessor communication is required; during the last step, only local datzi

is used in each prucessor, so there is no miessage passing betweeni the processors.

53
4. The general case

The above example generalizes to the case of a length n = 2 -7 transform, for %vh1c1 , k-

cube is available. Each processor is assigned a k-bit binary label. The result of section 1

allows us to allocate components of X and any intermediate vector X .as follows: each

processor will have = "7-k components at any stage; the processor labeled p is
2 "1/2k 2

allocated the components with indices


p.2 - , p.2 - + 1, ..,p.2 - + 22°-
2 -,-k ,-k +, , 2 y-k -y-

i.e if the processor label is

P= PklPk.2...Po
then the processor works on the components whose indices are
0
plPo 000 ....
Pk-lPk2 ...

plPo000 ....
Pk-lPk.2 ... O1
10
plPo000 ....
Pk-lPk-2 ...
11
plPo 000 ....
Pk-lPk2 ...

10
plPo 111l ....
Pk .1Pk 2...
11
plPoill ....
Pk-lPk 2 ..

From the discussion in section 1 it should be clear that the computation can be

decomposed into -ystages. The first k stages require interprocessor communication and the

last -y - k stages use local data only. During stage i, where 1 < i< k, each processor labeled

p = Pk-IPk_2...pP0
communicates with and only with the processor whose label differs frorn p ol ii In ii::

bit from the left.

54
5. Decomposing with the Chinese Remainder Theorem

The scheme presented in the previous sections works for a length that is a power of 2,

should generalize easily to any power of a prime integer. The next most obvious question

how to decompose the computation if the length is not the power of a prime. In this case,

a possible answer is provided by the prime factorization of n

n = PI'P2 12 ...pk--lk-L PkO*

and the ring isomorphism provided by the so-called Chinese Remainder Theorem (CRT).

The CRT states that if the integers n I, n2 , ... nk are pairwise coprime, the ring Z n ar

the product ring

Zn I1 xZ n2 x...xZ n
k

are isomorphic where n is the product of the ni, and the arithmetic in the ring : isordira:.

modular arithmetic (in the product everything is computed componentwise).

The forward isomorphism

Zn - Z n I1 xZn 2 x...xZ n
k

is given by

x-* (xmodn l ,x mod n 2 ,...,x mod nk)

The inverse mapping is given by

( al, a 2 , ...,ak )- x
where

x = [ a 1 u1 (U1 "1 mod n 1 ) + a 2 u 2 (u 2 - 1 mod n 2 ) + + akuk(uk "I mod nk) ] mod n

where u i = n/ni and ui is the inverse of u i modulo n i (which must exist since ui and n

are clearly coprime).

ar, pmrwi>,
The essence of the isomorphism is that if n = nln29...nk' where the n.i r!>.,

then an integer between 0 and n-i may be thought of as a k-tuple of integer, tic !:-,

between 0 and n 1 -1, the second between 0 and n 2 -1, ... , the last between 0 and nk-1.

55
Now the relationship
n-1

y[a = 2 x[alwab
b=0

may be written
n-In2 -1 n-I
z(bl,b ,...,bk) W (a,,2 . ak)(b 1 ,b2, . bk)
y(al,a 2, ak) = 2
bl=0b 2 =0 b&=O

The product
(aj,a2, " ak)(61,b2, " bk)

is computed in the product ring

Z xZ x...x Z
1 2 nk
as follows

(al,a 2 , ,ak)(bl,b 2 , ,bk) = (a1 bj,a 2 b2 , " akbk)

= albl(1,O,O,...,O)+a 2 b 2 (O,1,O,...,O)+ +akbk(O,O,O,...,1)

hence if we let
W 0,1, 0...' 0)
W=(O,....

where the 1 is at the ith position, then y(aI,a,, • • ,ak)


nlj- n 2 -1 nk-I ab

-E E " , x(bj,b.,...,bk) WI ' w ,-


W
b1 =0b 2 =0 bk=O

n-1= ' ' (n2-1 n-I


W E Wa'b2('( zi b W2 b"'
kb) ')

b1=0 b2=0 bk=O

Note that W.1 is an n.th


1 root of unity.

The computation of vector Y may thus be accomplished in k stages.

STAGE 1

nk-I

V ' b,-,ak)
*b,, - Z z(bt,b.,, , bk) WV'k

bk=0

S6
STAGE 2

nk -1 - I

Ub 1 ,bk-2,akI,-k) U k -at k -
bk-I =0

ISTAGE ij

nk -i +1 -1 "-I-

U bl, " bk-iak,-i+1,akj+ 2 ,... ak) -- , b 1.b 2 - b, +a) a -+-2'.


bk-+ =0

STAGE k n! 1_ 1 nb,,a

ULt(1 ,a 2 , ak. ,aj) - ( -1 , ak)

Stage i requires the computation of n/nk-i+1 DFTs each of length k-i-1. These DFTs rna%

be computed in parallel, once stage i-1 has been completely finished. Note that i' M r-I

A1) is the number of multiplications (resp. additions) required for the n.-transform, thew in
k M, k A,
total the n-transform requires n . - multiplications and n . - additions.
i- t= nt,

6. Future work

We need to find decompositions of the DFT matrix that allow better mappings of the

computation onto the hypercube. The scheme discussed in sections 1 through 4 for

example leave the load between the processors tinbalanced sorne ot the p r '-:-r- '.-

perform multiplications while others do not).

The decomposition of section 5 suggests a combination of pipelining and multiprocessing

57
that needs to be formalized. It also calls for the (parallel) study of transforms of short

lengths.

7. References

11] Fox, G. et al, Solving problems on concurrent processors, vol. 1, Prentice-Hall, 1988.

[21 Kallstrom, M. and S.S. Thakkar, Programming three parallel computers, IEEE

Software, Jan. 1988, pp. 11-22.

[3] Oppenheim, A.V. and R.W. Schafer, Discrete-time signal processing, Prentice-Hall.

1989.

[4] Saad,Youcef, Gaussian elimination on hypercubes, in Parallel Algorithms and

Architectures (M.Cosnard et al, eds.), Elsevier Sci. Publ.,1986, pp. 5-17.

15J Saad, Y. and M. Schultz, Topological properties of hypercubes. IEEL Tra,.-

Computers, vol. 37, no. 7, Jul. 1988, pp. 867-872.

58
Conclusions

As stated at the outset there are several avenues that we are interested in pursuing
further. Of these, the use of algebraic transforms and associated coding techniques
in the design of neural-net based classifiers stands out as the area where we should
focus research efforts next. We want to come up with classifiers capable of real
time response and having low error rates. As part of this effort we need to identify.
using coding-theoretic knowledge, transforms suitable for operating environments
which have different design requirements: large number of classes, large block
length, high fault tolerance. After establishing that the proposed family of
architectures are worthwhile pursuing, we will next study how to make them more
flexible by considering their concatenation with learning architectures to allow for
the generation of new classes, this being achieved without prohibitively increasing
the overall response time of the composite adaptive network. The last phase of the
proposed effort involves the computation of the error rates for the most promising
of the resulting classifiers. Complete simulations need to be carried out on realistic
classification problems.

Concurrent with this report we are submitting a proposal for a follow-up project
based on these considerations.

59
MULTILAYER TRANSFORM-BASED

NEURAL NET CLASSIFIERS

1. SUMMARY

This proposal is seeking support for the investigation of neural network classifiers
based on discrete transforms whose kernels can be identified with linear and nonlincar
algebraic binary codes. More precisely, the study will concentrate on the feasibility of
designing and implementing nonclassical classifiers that are multilayer neural nets,
where one or more layer transforms points in the pattern space into points in some

appropriate "frequency" domain, and where other layers classify the latter Points

through attributes of the transform used. Concatenation of the proposed networks With

exemplar classifiers (such as adaptive resonance theory classifiers) will also be

experimented with.

2. PROJECT OBJECTIVES

The main objective of the project is to produce classifiers with low error rates and real-
time response. These two attributes are viewed by researchers as the most fundamental
characteristics of classifiers with regard to the application of pattern classification to
real-world problems such as speech processing, radar signal processing, vision and
robotics. This project will explore the possibility of achieving a low error rate by viewing
patterns in "frequency" domain.

Mappings from pattern space to frequency domain that are continuous must tc
devised, where continuous means that patterns that are close under the pattern space

60
metric will have their images close under the Hamming distance. The frequcncy
domain will then appear as a code space and pattern classification might be thought of

as error correction in this space.

In the first phase of the project, the codewords will be "hard-wired" among the
interconnections of a neural network. The lower layers of this net will implement the

foregoing continuous mapping. They transform pattern vectors into noisy versions of
the codewords corresponding to the class labels. The upper layers of the net will use
coding-theoretic properties in code space to correct errors in the corrupted codewords,
thus identifying the class labels. Because of the hardwiring, the training time and the

operational response time should be relatively very short, the latter corresponding only
to the propagation delay among the layers of the neural net. Since it is not expected

that a single transform (i.e algebraic code) will accommodate all classification
situations, the identification of suitable transforms and their matching with different
operating environments (large number of classes, high fault-tolerance, high pattern

vector length, etc.) will also be studied. The large body of knowledge about algebraic
codes and decoding procedures will be put to use. It should be noted that many

transforms used in signal processing do arise from appropriate algebraic codes; for

example, the Walsh-Hadamard transform uses the celebrated Hadamard code.

The next phase of the project will investigate the possibility of introducing more
flexibility into the above architectures. Clearly for this architecture the class labels

must be known beforehand, the choice of codewords that identify the labels must be
built into the net, and learning from examples is nonexistent. By concatenating thc

proposed architecture with learning architectures, a fine-tuning of the parameters ot


the classifier may be performed. This approach can yield a high payoff in terms of error

rates, but of course the training time and operating response time might significantly

61
increase.

The last phase of the project will address the computation of the error rate for the
proposed classifier architecture in order to evaluate its place among proposed and
implemented classifiers. A difficulty that arises here is that the continuous mapping
used for going from pattern space to code space may introduce additional errors
(noise).

Throughout the project, simulations will be conducted. Realistic examples of


classification by neural nets ar well known to require a large amount of memory and
interconnect computation.
It will be interesting to find out the tradeoffs that can be achieved through the
proposed architecture.

3. REFERENCES
F. J. McWilliams & N. J. A. Sloane, "The Theory of Error-Correcting Codes", North
Holland, (1977).
S. Thomas Alexander, "Adaptive Signal Processing', Springer-Verlag, (1986).
David G. Stork, "Self-Organization, Pattern Recognition, and Adaptive Resonance
Networks", J. of Neural Network Computing, (Summer 1989).
R. P. Lippmann, "An Introduction to Computing with Neural Nets", IEEE ASSP
Magazine, (April 1987).

62
MISViON

OF
ROME LABORATORY

Rome Laboratory plans and executes an interdisciplinary program in re-


search, development, test, and technology transition in support of Air
Force Command, Control, Communications and Intelligence (C 3 ) activities
for all Air Force platforms. It also executes selected acquisition programs
in several areas of expertise. Technical and engineering support within
areas of competence is provided to ESD Program Offices (POs) and other
ESD elements to perform effective acquisition of C3 1 systems. In addition,
Rome Laboratorys technology supports other AFSC Product Divisions, the
Air Force user community, and other DOD and non-DOD agencies. Rome
Laboratory maintains technical competence and research programs in areas
including, but not limited to, communications, command and control, battle
management, intelligence information processing, computational sciences
and software producibility, wide area surveillance/sensors, signal proces-
sing, solid state sciences, photonics, electromagnetic technology, super-
conductivity, and electronic reliability/maintainabilityand testability.

View publication stats

You might also like