Advanced Signal Processing Techniques: October 1991
Advanced Signal Processing Techniques: October 1991
Advanced Signal Processing Techniques: October 1991
net/publication/235054832
CITATIONS READS
18 40
3 authors:
S. Sengupta
State University of New York Institute of Technology at Utica/Rome
9 PUBLICATIONS 26 CITATIONS
SEE PROFILE
All content following this page was uploaded by Bruno Andriamanalimanana on 22 May 2014.
ADVANCED SIGNAL
PROCESSING TECHNIQUES
JENS Research
Rome Laboratory
Air Force Systems Command
Griffiss Air Force Base, NY 13441-5700
92-00290
'1i1III I11
H,111 II!-
11
This report has been reviewed by the Rome Laboratory Public Affairs Office
(PA) and is releasable to the National Technical Information Service (NTIS). At NTIS
it will be releasable to the general public, including foreign nations.
APPROVED: = I
TERRY W. OXFORD, Captain, USAF
Project Engineer
If your address has changed or if you wish to be removed from the Rome Laboratory
mailing list, or if the addressee is no longer employed by your organization, please
notify RL( RAp) Griffiss AFB NY 13441-5700. This will assist us in maintaining a
current mailing list.
1. AGENCY USE ONLY (Leave BlankI 2. REPORT DATE 3. REPORT TYPE AND DATES COVERED
Final
October 1991
4. TITLE AND SUBTTLE 5. FUNDI a
ADVANCED SIGNAL PROCESSING TECHNIQUES C - 30TN--D-0026,
Task 1-9-4265
PE - 31001G
6, AUT-OR() PR - 3189
B. Andriamanalimanana, J. Novillo, S. Sengupta TA - 03
W4U - PI
7. PERFORMING ORGAIIZATION NAME(S) AND ADDRESS(ES) & PERFORMING ORGANIZATION
JENS Research REPORT NUMBER
139 Proctor Blvd
Utica NY 13501 N/A
17. SECURITY CLASSIFICATION 18. SECURITY CLASSIFICATION 19 SECURITY CLASSIFICA11ON 20. UMITAT1ON OF ABS TRACT
OF REPORT OF THIS PAGE OF ABSTRACT
UNCLASSIFIED [NCLASSIFTED (INCLASSTFTED (IL
NSN 7S4dM -aMAM SIw-A"m19arm ~
P1 b ANSI SIG r~
2W02
Advanced Signal Processing Techniques
Technical Summary
Our research efforts have been in the area of parallel and connectionist
implementations of signal processing functions such as filtering, transforms,
convolution and pattern classification and clustering.
Much more than is reported here was reviewed and entertained for further study
and potential implementation. For example, we considered in detail
implementation of techniques common to all filtering such as the computation of
recursive linear equations using systolic arrays and pipeline and vector processors.
Similarly, various neural architectures were studied and implemented. These
include Hopfield, Hamming, Back Propagation and Adaptive Resonance Theor-
Networks.
We report here those topics which we found to be most promising in as far as their
novelty and applicability to the classification of pulses and additionally on a new,
more transparent method for the computation of the discrete fourier transform on
hypercubes. The report consists of four sections:
Algebraic Transforms and Classifiers presents a two-layer neural net classifier with
the first layer mapping from feature space to a transform or code space and the
second layer performing the decoding to identify the required class label. This
research strongly suggests the further investigation of new types of neural net
classifiers whose kernels can be identified with linear and nonlinear algebraic
binary codes.
Models of Adaptive Neural-net based Pattern Classifiers deals with design and
implementation issues at the operational level of ART-type parallel distributed
architectures in which learning is asynchronously mediated via a set of concurrent
group nodes comprising a number of elemental processors working on component
substrings of the input pattern strings. A single-slab Bland-net alternative
architecture is also proposed.
Acoesslon For
NTIS GRA&I
DTIC TAR l
__I 2
.4vall oned/or
Mat Special
ALGEBRAIC TRANSFORMS AND CLASSIFIERS
A two-layer neural net classifier is proposed and studied. Codewords from a {1,-I}
algebraic code are "wired' in the net to correspond to the class labels of the classification
problem studied. The first layer maps from feature space to some hypercube {n,-1} n ,
while the second layer performs algebraic decoding in {1,-1} n to identify the codeword
The main motivation for introducing the proposed architecture is the existence of a large
body of knowledge on algebraic codes that has not been used much in classification
problems. It is expected that classifiers with low error rates can be built through the use of
Artificial neural nets have become the subject of intense research for their potential
contribution to the area of pattern classification [31. The goals of neural net classifier
construction are manifold and include the building of lower error rate classifiers than
classical (i.e Bayesian) classifiers, which are capable of good generalization from trainii.
data, and having only reasonable memory and training time requirements. Most neural r1(i
classifiers proposed in the literature adjust their parameters using some training data. In
this proposal, training has been set aside to bring out possible benefits of algebraic coding
In subsequent work, classifier parameter adjustment will be studied in conjunction with the
The neural net consists of three layers of units connected in a feedforward rnialmer.
The number of units in the input layer is the length k of each feature vector. The hiddcn
layer has as many units as the length of the algebraic code used. Finally the output layer
3
contain one unit for each class in the classification problem. The chosen codewords from
the algebraic code used are wired into the net via the interconnection weights betwccrn th.
input and the hidden layers. The decoding properties of the algebraic code come into play
in the interconnection weights between the hidden and the ouput layers. The actual choice
for these weights will be detailed below. Note that even though the topology chosen here
resembles other popular neural net topologies, such as that of the back-propagation
network, this first version of our classifier is non-adaptive and non-learning (the pattern
classes are not learned but wired in from some a priori knowledge), and the time needed to
recognize a pattern is only the propagation delay through the two layers of weights.
Figure 1
4
The network operates as follows. A feature vector f to be classified is presented to the
input layer; this feature vector is a corrupted version of some class representative I (the
latter associated with the codeword c). The second layer of units outputs a vector u, the
'frequency domain' transform of the feature vector f, which is a corrupted version of c. The
output layer decodes u into the codeword c, thereby identifying the class represented by 1.
The output layer does this by having all units 'off' except the one corresponding to
To make the net operate as required, we need to devise schemes for wiring codewords in
code space (feature vectors that are close correspond to vectors in {1 ,- 1 }n that are also
close; this is important if the error rate is to be kept low), decide the transforriatiOl
performed by each unit and find a way of making the decoding of a corrupted codeword
correspond to turning on exactly one unit in the output layer. Finally, we must analyze the
kind of algebraic codes that are suitable for the proposed architecture.
First, note that the input layer of units will simply buffer the feature vector to be
classified; hence the units here perform the identity transformation. Secondly, it should b,
clear that the units in the hidden and output layers must be polar (i.e they output I or -1:
actually we could have made them binary but it is easier to deal with polar outputs for
neural nets). We have decided to make them threshold units, outputting +1 if the net
input is at least the threshold e of the unit, -1 otherwise. The value of 0 will be determined
below. But first we need to make some elementary observations about {1,-1} codes.
We remark that if c and c' are two {1,-1} vectors of the same length n. then
c . c' + 2 distance(c,c') = n
where the dot is the usual dot product and distance is the Hamming distance between the
two vectors. This is so because for {1,-1} vectors, c . c' is the number of coordinates where
the two vectors agree minus the number of coordinates where they disagree; hence c . C" I-
collection of distinct {1,-1} vectors all of length n, and if M is the maximum dot product
among different codewords of C, then the minimum Hamming distance among different
vectors of C is
0.5 (n - M)
and thus the code corrects any vector corrupted by less than
0.25 * ( n - M )
errors, using the usual nearest neighbor decoding scheme. Because of this. we
looking for {1,-1} codes where the length n is much larger than the maximum dot product
M.
Next, we remark that if c is the closest codeword in the code C to a {1,-1} vector u, and if
This can be seen as follows. Let u E {1 ,- 1 }n and let c be the codeword in C closest to u
(notc that at first sight there may be several codewords in C having minimum distance
from u; however if u has not been corrupted by 0.25 * (n - M ) or more errors, then this is
Now let u = c + e , where the error vector e has its q nonzero components equal to 2 or
-2. We have
Now
c . u = n - 2 distance(c,u) = n - 2q
6
Since it is assumed that less than 0.25 * ( n - M) errors have been made in corrupting c to
get u, we have
q < 0.25 (n - M)
Hence
Also,
w.U = w. (c +e)
-w.c +w.e
Note that the last observation will be used to decide the threshold for the units in tic
output layer, and the interconnection weights between the hidden and output layers.
Choose a suitable code C (suitable in the sense of having the length n much larger thaw
c 1 , c 2 , ... Ic m
use the following matrix as the weight matrix from the hidden to output layers
Cm
n. . . .
7
In other words, the weights from the hidden layer to the output layer are all 1 or -1, and
the weights from the n hidden units to the jth output unit are the components of thlL
Now if the output of the hidden layer is the vector u, then the net input to the jth output
unit is
C.
J . u
0.5 * (n + M)
then by the observation in section 3. only the unit corresponding to the codeword closest
The threshold in the units of the hidden layer are chosen to be 0. Now let Wkxn be the
weight matrix between the input and the hidden layers. Let Rmxk be the matrix whose
r1l r 12 r"" r1
r 21 r22 "2kr 2
Rmxk •
Several schemes can be imagined for computing the matrix Wkxn from the mnitrilcc !>
and C
m~xn
8
The first scheme is a simple correlation, defined by
the ith input unit to the jth hidden unit is the sum of the correlations between the ith
component of a class label and the jth component of the corresponding codeword (see also
Figure 1).
m
Wij E r1i c1j
1=1
The idea here is a natural one. In order to achieve continuity, we seek to strengthen the
Another scheme 'simulates' a linear transformation from feature space to code space. If the.t
hypercube {1 ,- 1 }n were a vector space, which it is not, and if the class labels, i.e the rows
of the matrix Rmxk, were linearly independent, then there would be a unique linear
transformation mapping class label r i onto codeword ci, for all i with I < i < m. In other
Rmxk . = Cmx n
If R were square, it would be nonsingular and we could write
W = (R mxk)" C mxn
which would give the interconnection weights we are looking for (recall a linear
Of course, the above assumptions do not hold in most cases. We may though use some
kind of approximation of the inverse for R, called the pseudoinverse of It and (enote, P
9
If the matrix R is of maximum rank, but not square, then R . R T is nonsingular and the
pseudoinverse R is defined by
R-=R T(R.RT)
where I denotes the identity matrix of the size of R . RT and c is a positive real number.
It can be shown that the pseudoinverse R always exists. Of course, in practice, we ouLy
compute an approximation of the limit most of the time, since the exact form of R ma%
be difficult to obtain. For a nonsingular matrix, the ordinary inverse and the
In any case, we can take as weight matrix from the input laver to the hidden Liavr III,
product
Wkx n = R .Crx n
6. Hadamard codhig
It is believed that the error rate of the proposed classifier depends heavily on the numbler
of errors that the algebraic code wired into the net can correct, which, a~i we hav, s,.
a linear function of ( n - M ), where n is the length of the code and MNis the maximum dot
product. We then need to investigate {1,-1} codes for which n is significantly larger thn,,
M.
The class of Hadamard c-)des seems to provide a reasonable starting point. Recall that
Hadamard matrices constitute a handy tool in many disciplines, including signal processing
designs). Even though Hadamard matrices exist only for certain orders, we believe ti.t
they are still quite useful for the classifier architecture proposed.
10
A Hadamard matrix of order n is a square matrix H of order n, whose entries art 1 or -1.
H. HT=nI
The definition really says that the dot product of any two distinct rows of H is 0, hence if
we take as codewords all the rows of H, then M = 0, i.e the code can tolerate any number
of errors less than 0.25n. Of course the threshold of each output unit of the classifier mrus
be set to 0.5n.
written -.
n 1: Hi[1] n = 2: H2 [ 1]
n =4: H4 = 1 1 1 1 n 8: H8 1 1 1 1 1 1 1 I
1- 1- 1-1 -1 -1-
11-- 1 1-- 1 1- -1
-11- - -1 - -1 - -
1 1--- 1 1'.
1 - - -1 1 1 !I
Figure 2
It can easily be shown that a Hadamard matrix of order n can exist onk'. if n 1. 2 or 2
0 mod 4. It has not yet been proved that a Hadarnard matrix of order n exists, w,,,,
11
For our purposes it is enough to note that Hadamard matrices do exist for orders that are
a power of 2. This is seen through the fact that if H n is a Hadarnard matrix of order 11.
H2n= H -H
7. Simplex coding
Recall the construction of the finite field GF(p k ) with pk elements and prime subfield
GF(p) = Zp, where p is a prime number (p = 2 is the most important case for our
chosen. The elements of GF(p k ) are taken to be all polynomials with coefficients in (;lF;,
and of degree less than k. Addition and multiplication are ordinary addition and
multiplication of polynomials but performed modulo h(x). It can easily be proved that
indeed GF(pk) satisfies all axioms of a field and that it contains pk elements.
called a primitive polynomial over GF(p). Note that databases of primitive polynomial!.
Let
We define the pseudo-noise sequence generated by h(x) and given bits aO di .........
12
where, for I > k,
n=2 k _1
Then for any cyclic shift C' of C, C + C' is another cyclic shift of C.
(ii) If C and n are as in (i) then C has 2k-1 ones and 2k-1 - 1 zeros.
a0 , a1 , ... , a2 k. 2
and all of its 2 k - 2 cyclic shifts.
The code C- together with the all-zero vector of length 2k - 1 is also known in tl,
literature as the cyclic simplex code of length 2k - 1 and dimension k. By property (i)
n
above, the simplex code is a vector subspace of dimension k of GF(2) , where n = 2 -
From C- a {1,-1} code C is constructed, through the affine mapping a - 1 - 2a (i.e replace
0 by I and I by -1).
distance between the corresponding vectors in C-, which by linearity of the s1iTpicx co-,h
C- U {O}, equals
13
distance(0,c 2 -c 1 )
= distance (0, c. +c 1 )
=weight (c 2 +cl )
= 2 k - 1 by property (ii) above.
So,
u.w = n - 2 distance(u,w)
= 2k - - . 2 k-1=
1 2 1
Hence for the code C, the maximum dot product is M = -1. C can tolerate any number
of errors less than 0.25 * (n+1) = 2 k - 2. The threshold of each output unit of the classifier
a 3 = a0 + a,. h1 + a 2 . h 2 = 0
a4 =a1 +a2.h 1 + a 3 .h 2 = 1
a 5 =a 2 + a3 . h 1 + a4 . h -1
a6 =a 3 +a 4 h1 + a5 . h -1
0010111 11-1
1001011 -11 -1--
1100101 - -1 1 - 1-
1110010 1 - 1 1
0111001 1
1011100 1 -1
14
Another example is provided by h(x) = x 4 + x + 1, a 0 = a1 = a2 = 0 and a 1. \c
compute
a4a5a6a7a8a9a
1 1 2 a1 3 a1 4
01a1 a = 00110101111
The code C isshown below, for which the {1,-1} code C iseasily obtained.
000100110101111
100010011010111
11000100110101i
111000100110101
111100010011010
011110001001101
101111000100110
010111100010011
101011110001001
110101111000100
011010111100010
001101011110001
100110101111000
010011010111100
001001101011110
8. Future work
We need to analyze more classes of algebraic codes suitable for the proposed architecture.
One of the promising directions is the area of codes from block designs (as mentioned
earlier, the Hadamard codes actually fall under this area). Properties of these codes arc
often distilled from the geometric properties of the designs. Work in hi ar,',
progress.
Next we should actually attempt an analytical study of the error rate of the proposed
1S
classifier. Note that the mapping from feature space to code space may actually introdue
more errors. Even if this mapping does not introduce any error, it may still be dIj cuc 1,.
derive a closed-form of the error rate, so bounds may be the best we can do.
Finally we need to design an adaptive version of the proposed architecture, where the code
is somehow wired in the net but a bettering of the estimates of the parameters of the
9. References
computational abilities, Proc. Nat'l. Acad. Sci., USA, v.79, pp. 2554-2558, 1982.
[21 LippmannRichard P., An introduction to computing with neural nets, IEEE Acoustic-
[41 MacWilliams, F.J and N.J.A. Sloane, The theory of error-correcting codes, 2 volumes.
North-Holland, 1977.
5 Widrow, B. and R. Winter, Neural nets for adaptive filtering and adaptive p:
16
An Interaction Model of
Neural-Net based Associated Memories
1.Introduction
One of the most focused topic in the entire neural network paradigm that still contin-
ues engagement of major research effort is evidently the topic of Associative or Content-
Addressable Memory (CAM) in the context of a Connectionist framework (9). An Associ-
ative Memory (AM) or a Content-Addressable Memory is a memory unit that maps an
input pattern or a feature vector Xk E Rn to an associated output vector Yk E R. in O(1)
time. The input pattern Xk may be a substring of the desired output pattern Yk, or ma; be
totally distinct; however, in both cases, we identify an association of the form (xk, Yk) for
each pattern index k. An AM or CAM is a storage of all such association pairs over a
given pattern domain. The input string is also known as the cue-vector, or a stimulus, the
output string, the associated response pattern.
Viewing it from a neural network perspective, the ensemble, even if conceptualized as
a pattern classifier/recognizer in a strict functional sense, is essentially an AM!iCAM entity
if it works correctly. Because ultimately, a neural-net based classifier 'recognizer logtcui.:
maps an external object via an appropriate feature space onto distinct claz:,:!, wt: cj>
say that in a functional context (object - input pattern (stimulus), pattern-clz,,
comprise associative pairs. In this section we present a new model of Associated Memories
or Content-addressable Memories realizable, say, via an appropriate optical implementa-
tion of an artificial neural net type system using lens, masks, gratings, etc (4,6). The issue
addressed here is not this realization task per se, but the more important issue of models
basically congruent to neural net type systems from the point of view of functionality and
performance, the models which reflect distributivity, robustness, fault-tolerance. etc. T
implementation issue has to be addressed eventually - one could aim for a neuira ne,
system in an electronic (as opposed to optical) environment but that as a topic must',
till one clears out the model related issues, the overall architectural framework of achievl-
a certain set of objectives at the topmost level of specification.
In this paper, a matrix model of Associative memory is proposed. The strength of the
model lies in the way it is different from the traditional AM models articulated thus far
(see Hopfield (7,11), Kohonen (8,9) etc. for Associative Memory models). This model
becomes significant at high storage density of patterns when the logically neighboring pat-
terns do tend to perturb each other considerably. A memory system even locally stable in
the sense of Lyapunov at or around specific logical sites may tend to display erraticitv
upon receiving another pattern for storage if the pattern storage density increases (12.3)).
In such cases, instead of trying to keep patterns away from interacting with each othr.
, -
one could de;iberatev allow them to interact and then consider all such co'ec,-ive i::,'n
tio1s in the overall dynamics of the system. To date, no such model h;s been nr :
Our model is different from the proposed classical memory models in one s'
sense. It allows interactions among patterns to perturb the associated feature space. As ,.e
pack dynamically more and more patterns into the system, more skewed becomes the
neighborhood' of a pattern. It is analogous to the situation where an electric field around
17
a charged element is peturbed by bringing in more charge particles in the vicinity. It is
this recognition in our model that provides a changed perspective.
We observe at the outset the two salient features associated with neural net type pro-
cessors or systems using such processors. First, a neural network model is ultimately a
content-addressable memory (CAM) or even an an associative memory (AM) depending
upon how it is implemented. The central theme in such models is storing of information of
the relevant patterns distributively in a holistic fashion over the synaptic weights between
pairs of neuron-like processor elements allowing both storage and retrieval tasks more or
less in a fail-soft manner. Even though the time-constant of a biological neuron's switching
time is of the order of milliseconds, storage and recall of complex patterns are usually car-
ried out speedily in a parallel distributed processing mode. It is ultimately this notion of
distributed memory in contrast to local memory of conventional computers that stands out
in so far as neural net type systems are concerned.
A neural network system could also be viewed at the same time as a pattern
classifier/recognizer. Problems like Radar target classification, object identification from
received sonar signals are where neural net type systems are most adept. In particular, if a
recognition of a partially viewed object or a partially described object has to be made, if
the data is further contaminated with white noise then not only a neural-based classifier
appears to be more promising for real-time classification from the point of" vie,.v ofL per:fr-
mance but a CAM or an AM oriented system might just be ideal. It depends, of course, or,.
how we ultimately structure the problem domain, what type of cues are allowed for
appropriate responses, etc. but it is eminently possible to lean on a specific CAM/AM
oriented neural-net paradigm to do just that. In this section, a generalized outer product
model is proposed.
The associated-memory elements ought to yield a performance close to 0(1) search
time. However, once pattern-pattern interactions are allowed within the model one cannot
guarantee 0(1) performance. In our model, it is feasible that on some cue patternS t:z,
problem is so straight forward that one need not consider pattern-pattern interac:(or' at
all. In such cases, it behaves as classical AM element with 0(1) search-time.
Consider an orthonormal basis {Ok} spanning an L-dimensional Euclidean vector
space in which associated pairs of patterns (cue, response) in the form (Xk, Yk) are stored
in correlation-matrix format of association. The whole idea is to buffer the stored patterns
from the noisy cue vecors via this basis such that one could see the logical association
between x and y via a third entity, our basis. Note that a cue vector Xk in this basis is
expressed as
18
such that the dot product between two such vectors xk > and x, > is
2
<XkI Xk > = (fki) = akk
Note that the cue-vectors themselves, in this model, need not constitute an orthogonal vec-
tor space. Even though this is a desirable assumption to consider, in reality, it may not be
a good assumption to depend on. Secondly, the vector space {0) need not be complete, an
appropriate subsapce would do as indicated by Oja and Kohonen (13). They build the pat-
tern space over some p-dimensional subspace with p :_L and deal with a pattern by hav-
ing its projection on this subspace and extrapolating its residual as an orthogonal pro~ec-
tion on this space. We could also extend our model in this way.
We could interpret the coefficients aki associated with a cue-feature as follows. The
normalized coefficient aki/ ak, 2 could be regarded in this model as the probability ampli-
J
tude that the vector I xk > has a nonzero projection on the elemental basis vector 1 6i >. The
square of such terms are the associated probabilities; if we want to picture it a cue vector
in this representation would be a sequence of spikes on the 6-domain whose height at an,.-
specific 0-point is the probability amplitude that the vector there has a nonzero projec::o0:.
Note that the sum of the square of all these projections must necessarily equal to 1.0 for a
nonzero cue-vector.
Given (cue,response) pairs we form matrix associative memory operators as follows:
= IIYk><XkI
k
Mr = I k><k
k
19
Such that its projection on any arbitrary I Oq > is
We call this the figure of merit of a cue x, on a basis vector 0(, fm(xl
Obviously, the memory operator has the largest projection when q=l, i.e.
res(ys) = res(y, I s)
In the simplest case, one assumes res(.) on associated pattern vectors to be very large. 0-,
equivalently, one assumes that the storage of multiple (cue,response) pairs is facilitatec bv
the requirement that the Hamming distance between any two pair of such cues in the
library is relatively large. In such cases, after receiving the index value I associated with
the given cue Ix, >, one continues by operating Mr on 1i,1 >. Then
<xk 6> - ki
20
<0km IMs Ix> = (a m + am1) + E Qiki
The matrix elements show deterioration of performance. Since, the cue-vectors iI,e.d n)., 't.
general, form an orthonormal basis the first term on the right hand side of the matrix ele-
ment would, in general, contain the usual cross-talk component, but now, in addition to
this, the distortion 6, even though orthogonal to the most likely cue-vector xI, would intro-
duce additional degradation.
The question of optimization of the memory space is now the issue. Given the possi-
bility of receiving distorted signals at input ports whence one obtains the cue-vector in
question, one may approach the design problem in a number of ways.
One simple approach, particularly when storage density is sparse, is to consider the
possibility that the received cue x is close to one of the cues considered acceptable. Under
this assumption, we form both the on and the off-diagonal term- <0q NQ x> and note the
index q for which the matrix element is largest. We then logically associate x with y,, or.
the assumption that yq is what is stored logically at the basis Oq. The task of retrieving
the associated pattern given a cue is then a straightforward problem as long as a relatively
sparsely dense set of (cue,response) pairs is associatively stored. The problem occurs when
cues are c!ose to each other or when one finds that a single level discriminant may not s.F-
fice. If an unknown cue x is not stored precisely in the format it has been received. it it i
equally close to xq as it is to Xr, with q * r, we must have further information to break the
impasse.
One simple strategy is to extend the proposed abstract AM model as follows. In this
case, we do not store just single instance (cue, response) items but, instead, a class of
items. In other words, we prepare a class-associative storage ( or a class-content address-
able storage) in the form below
c,= E bk><OkI
k
Though the essential structure of the tuple (cue,, cue 2 , ..., response) is simple enough it
could be exploited in a number of ways allowing us to extend the scope of simple
CAM/AM based storage/retrieval task either through a spatial or a temporal extension, or
a combination of both.
In the sequel, we conjecture some simple schemes.
A. Spatial Extension.
1. Concurrently, obtain , input signals {xi, 1 < i < n}. These are n different
instances of same measure or different measures of same feature element. The
21
underlying hypothesis is that all these xi measures point to essentially the same
response vector y.
3. At next cycle, at the next higher level, obtain r7(yi) for every occurrence of yj.
q/(.) yields the number of occurrence of yi.
S
In this case a concurrent evaluation is proposed after the input stage. At the next laver, a
decision logic gate at the output port yields the optimum desired response.
XX"
Y3
Obviously, this simple design is robust, distributed and modular. It also provides mul-
tiple modular redundancy. In the next proposed extension, we introduce a model which is
essentially extensive in the temporal domain.
B. Temporal Extension
1. i -1
2. At cycle i, obtain (x i , yi)
3. While res(y-) < 6 do
3.1 i - i-i-1
3.2 obtain at the ith cycle
((xi,yj) I res(yi- 1 ) < 6, res(yi_ 2 ) < 6,... , res(y 1 ) < 6
22
3.3 compute res(yi)
Here the crucial element is the retrieval process specified in 3.2. At the ith cycle, the
scheduled retrieval process is in WAIT state until it is fired by the appropriate event at
the (i-1)th cycle, namely the arrival of the condition res(yi_ 1 ) < 6. The two schemes are
outlined below.
FO -'-0-
% ere,
e Logically, the cue-vector I x k > is associated with the basis vector -" ni b:
with K with a joint probability aJ,
e However, closer the cue vectors kk- and , > are more is the error o;e ,
to make in this association. In view of probable misclassification, the weight or.
the initial proposition must therefore be reduced. This we do by considering the
counter proposal that one could have xk > associated with l> and N1 > with k-.
23
respectively, with a probability Ok, instead. Higher is our conviction in this
regard (higher bkI values) lower becomes the strength of our original hvpotheiis.
Given an arbitrary cue vector x, let us assume that res (y I x) < 6, and that within a
threshold limit the cue-vector x is similar to other cue-vectors whose logical associations
are, however, different. That is, we suppose an equivalence class associated with x
Given this, we would like to know how best one infers the corresponding associated vector
y with x. Let us denote the matrix element
with x xq> and xq is logically a )riorl associated with some y, via the fuinltion c.
the as. irtion that x is associated to some y ;> ,h 3 function Pm, the matrix element
(m,n,*,q) becomes a measure of the strength of this hypothesis. Therefore, to obtain the
optimum association, we compute
max (m, n, *, q)
m.n.q
and obtain the corresponding Om, whence we obtain the most logical association v via
Mr I Om>= Y
One could extend this notion of logical substitutability and reorganize the association space
accordingly.
This is carried out next in our three-point memory system model as indicated below.
Note that the right hand side is summed over all permutations. The matrlx eiements are
then
24
<r(1) €, (2) Ot (3) IM' IXk(1) x,(2) xm(3)> = P { ( 1 )P at ak a a
P
where (rs,t,k,l,m) is the matrix element in question. Note that the substitutability
rationale could be advanced in the same way as we articulated it earlier in the two point
memory system. If the string Xk is found similar to x, and xm then the strength of the
hypothesis concerned with the a priori association of Xk with Pk ought to be adjusted in
view of the ii-elihood of false memory assignments via kk, 41, and 0m. In our model, we
propose that with respect to the original configuration k(1) 01(2) 0.(3) every other
assignment of the form 0k,(1) Ok,(2) Ok,(3), where k1 , k 2 , k 3 is some permutation of the
indices k,l,m is either an -;nhibitorymemory association (reduces the strength of our initial
hypothesis) or a contributory memory association (advances the strength of our initial
hypothesis) depending on the number of transpositions required to transform {kl.m} to
{kl, k., k 3 }. That is, if there are q such transpositions required to bring the string {k.<.
into {k l , k,, k 3 } then the parity of the new memory assignment is (-1)l' as though every
single transposition is tantamount to an adjustment contrary to the direction of the origi-
nal assignment.
We consider the following diagram to illustrate the point that our model suggests.
P A I
S
7
0TT
Assume that on the energy 'hill", at the consecutive local minima, the di'ti8C: \:)-
terns are stored as shown by the points P, Q, R, etc. Suppose, also, that these patte-n-
are so close to each other that sometimes instead of retrieving the pattern Q from te
storage we recall some pattern R. It is pointless to suggest that we sharpen our pattern
specification more tightly with a view to minimizing pattern misclassification to some
25
tolerable level. Accordingly, we approach the problem in the following way. Indeed, on
some cue, let us assume, it appears that the most probable recall is, say, pattern Q. How-
ever, what if we had the patterns stored in the order say R Q P S T ... or P Q R T S
instead of the indicated order of storage on the hill P Q R S T ... Would that make an'y
difference though? Would we still return the pattern R when we were supposed to return
Q, instead? Surely, these alternative orders are also probable with some nonzero probabili-
ties simply because the patterns (R and P) in the first suggested list, and the patterns (T
and S) in the second list are conceivably interchangeable to the extent these few similar
patterns are concerned. But, then, why restrict ourselves to single-transposition lists?
Surely, if P Q R S T ... is a feasible list, then the list obtained by double transpositions
such as(PR) (PT) {PQRST...}={TQPSR .... is also a probable. However, we
make a somewhat qualified claim now. If the storage order in the first approximation on
the 'hill" seems to be the list P Q R S T ... coming down the hill, then a storage order
implying a p-transposition on the original order ought to be less probable than the one
with a q-transposition, if q < p. In other words, in our scheme of approach, the one point
memory model is the most optimum first-order AM/CAM model we could think of. Its
improvement implies a model in which interactions among patterns cannot be ignored any
more. The two-points memory system, and then the three-points memory systems are then
the second and the third-order approximations, as indicated, with more and more pattern-
interactions taken into account.
Therefore, one could suggest a plausible model in which we carry out storage and
retrieval of patterns as follows. We assume that AM/CAM memory need not always
remain static. We, instead, suggest that such a memory should be dynamically restruc-
tured based on how successful one continues to be with the process of recall. We assume
the memory system to be in state -y, that is its current memory is so densely packed that
at time t it is a -f-point memory at some level p (p is an integer less than or equal to -1) as
indicated below
>
Mp = E EP {(-1) aklm...q Ikk(1)01(2)Qm(3) ...0(nj)
klm...q p
<xk (1)xl(2)xm(3) " " Xq(n.) I
If recall of similar patterns, in this memory, gives us a hit ratio less than soue thres-
hold parameter p. then we should restructure the memory as a M'1. 1 element, and con-
tinue using it. This, probably, is what does tend to happen in using biological memory.
Whenever we tend to face difficulties in recalling stored information (with a presumed
similarity with other information strings), perhaps, even consciously at times, we restruc-
ture our memory elements focusing on other nontrivial features of the patterns not con-
sidered before.
Acordingly, one could approach the storage/retrieval policy as follows. Assume that
the system is at state -1, i.e. its memory is at most M"". Given an incomirg pattern x.
system tries to return a vector V using the simplest memorn M t in w: h intcr :,:
among the stored patterns are ignored. If, within some threshold parameter 67. it Fnds_.:
that it could, instead, return any one of the vectors in the set S? = { z . z,'} then it
considers its memory to be logically the M7 and attempts to return 7F with less uncer-
tainty. If, however, it finds that, within some threshold parameter V, it could, instead.
26
return any one of the vectors in the set S3 = {z1, z ... ,Zr} then it considers its memory
organized as M and continues with the process. We assume that, while the system is at
state -y, the memory could be allowed to evolve sequentially at most to its highest level
M.
What if, the system fails to resolve an input cue vector even at the memory level
Then, we could reject that pattern as irresolvable and continue with a fresh input. The
system could be designed to migrate from the state -y to the state -1+1 if the number of
rejects in the irresolvable class mounts up rapidly. Otherwise, we leave it at the state -1,
and on each fresh cue ' we let the system work through M7 - -- -- ... -- M as
need be.
References
10. Lee Y. C. et al: 'Machine Learning using a Higher order Correlation Net-
work', Physica, 22D, pp 276-306, 1986.
2-1
11. McEliece R. J. et al, 'The Capacity of Hopfield Associative Memory", IEEE
Trans. Information Theory, IT-33, pp 461-482, July 1987.
12. Sussman H. J. "On the Number of Memories that can be Perfectly Stored in
a Neural Net with Hebb Weights', IEEE Trans. Information Theory, IT-35, pp
174-1178, January 1989.
28
Models of Adaptive Neural-net based
Pattern Classifiers/Recognizers
... ,xvi)T
e A pattern xi = (x1i, xn.i,
29
is in class cA if the corresponding class-exemplar 5 is
near to xi within a tolerance distance (or an uncertainty) of SA, i.e.
where d(.,.) is some suitable distance measure on the metric spanned by the pattern vec-
tors.
In an unsupervised environment, the emerging class-exemplars or the cluster proto-
types in a multiclass domain could be made to obey some additional constrains. Assuming
that we do not a prioriknow their distributions except the requirement that their mutual
separation must be at least as large as some threshold parameter, one could require that
_>A,5O)
Pmin
The idea is that at some distance or lower a pair of clusters may lose their individual dis-
tinctiveness so much so that one could merge the two to form a single cluster. Precisely
how Pmin ought to be stated is a debatable issue, but, it could be made related to the
inter-cluster distance (ICD) distribution which we may conjecture to be a normal distribu-
tion with a mean and a variance a2. Accordingly, we could let
PMin = P + P ap
a is the variance of the ICD-distribution N( , c2), assumed to be small. The problem is.
that we usually do not know this distribution a priori. For an unknown pattern domain, it
may not even be possible to predict a priori in how many distinct pattern classes should
the pattern space be partitioned. All we know is that more separated the equivalence
classes are more accurate would one be in ascertaining that a specific pattern belongs to a
specific class.
Ideally, clusters ought to be well separated on the feature plane where the patterns
are recognized. This is desired by a high value of . This could be achieved by designing a
feature space in which small differences at individual pattern levels could be captured. This
is a design issue, an encoding problem which could be separately tackled to an arbitrary
degree of performance. This issue would be addressed in our subsequent papers. However,
there is another problem, which we should look into. This arises in situations where the
order of input matters. In a functional system, we do want the pattern-classes to be
i.e. we want the ,ntra-cluster distance measures to be as small as possible so that we cou,_..
have crisp clusters and retain them as such. This means that the variance measures or, the
exemplars or the average coefficient of variation of an exemplar ought to be small. But
then, is this, or should it be ever under our control? The pattern-sampling that takes
place during the training period may include event instances so noisy and incomplete that
30
for all practical purpose they could be considered garbage. Should we let the system be
trained with such feature vectors? One could suggest that if on inclusion of such a feature
vector the class prototype is perturbed more than one could tolerate then one could aban-
don the vector altogether. However, this means that to be reasonable one should make the
level of tolerance function of number of patterns already accommodated within the class.
In this section, we conjecture some models which are obvious extrapolation of some
neural-net based models already advocated. One could consider a known classifier model
and expand it to suit a specific architecture, in which case one starts with the advantage
that at least results up to the preextrapolated modeling state is known and could be con-
sidered as a reference basis for further studies. However, we have some problems in this
regard. It is possible that
* without an extensive simulation and actual real case studies, one may not be in
a position to confidently use these extrapolated models, even though, they do
appear fairly plausible at this stage,
* no neural net model has been so extensively studied to understand the extent
to which the model is applicable under realistic noise and measurement oriented
issues,
O---
and
32
J>
(c) each PUM, working independently from each other, attempts to iden-
tify the best exemplar its input pattern corresponds to. We assume that
there are, ji + 1 distinct classes to choose from. One of these, the refuse
class, would correspond to those patterns which are too fuzzy to belong to
any of the other A, classes. The exemplar weights are stored in the b-
matrix. The dot products of its input pattern with the exemplar patterns
are computed at this level as
A _ blji ali
where, the index i is over all neuron-type processors comprising the Ith
PUM, and the index j points to a specific class or pattern exemplar.
(d) The tentative cluster I(j') for the Ith PUM conglomerate is obtained
as
1(j') = index (max {AI j})
If alj. is less than &' then the pattern goes to refuse class.
tI(j')i oi
I
ai -
(g) We next compute the frequency of occurrence rn(j*) of the exemplar j'
as the indicated pattern choice at the subpattern levels, and obtain
(al - P1) Thr (a1 - P0)
19j)= m(I(j')) Z
P1
33
where, Thr(x) = 1 if x > 0.0, otherwise, 0.
j = index (x ))
J
(i) Synaptic weights (both top-down and the bottom-up weights) are
updated if 8(j") > -y, some threshold parameter. Then, referring to time
instances by the label 1,we have,
tlj'-i(l+l) tj'-i(l) a n
ti
t'j i (1+1)
0.5 + E t1 j; (1) on
()If the weights ., pdated via (i) then on all PUM nodes I with I(j 5 )
as its choice, w - .ate
pl (I+1) = max
I {p p(1)}
A Bland-net Classifier:
34
F(t) = {at1 , at 2 , ... ,t }
(b) Assign random weights to the synaptic elements bji. Note that the ele-
ments bji E [0,1].
-w-
A Bland-net System
(e) Update synaptic weights bj'i if the last pattern is included in this class
as
Classifier Architectures
3S
Training a system to the eventual task of correct pattern-class identification on a pat-
tern space may be order-sensitive in some cases. This, of course, depends upon the algo-
rithm used to identify clusters, and the way the training patterns are introduced at the
input and processed by the system subsequently. If the system at the task level is essen-
tially sequential in the sense that the individual patterns are sequentially admitted and
processed one at a time against the already acquired knowledge, tentative though may be,
of likely class-distribution, which, in turn, is expected to be incrementally adjusted given
the current new input, we could get an order sensitive distribution eventually. Ideally, the
input patterns, in its entirity should be concurrently processed as one single whole without
any prior reference to any distribution, so that after a given numb number of cycles, we
would obtain a stable, crisp, unbrittle distribution of the sampling population significantly
close to the population distribution in the universe of discourse. But, if the patterns arriv-
ing at an input port are only sequentially admitted as units of a single time-dependent or a
time-varying input stream, we cannot arbitrarily keep them on hold in some temporary
buffers to be released at some convenient time for concurrent processing in bulk, unless the
input rate is very high compared to processing rate.
To understand the overall dimension of the problem in this regard, we consider some
simple processing models. Suppose, the patterns arrive at a rate of Ap patterns/sec at an
input port obeying a Poisson distribution as follows. The probability that the number o:
arriving input patterns during an arbitrary interval of length t units is n is given by
-t (Apt)n
= n within t) = e
p(number of inputs n!
Suppose, in the first model, we have all the incoming patterns temporarily queued in a
buffer (infinitely long) till they are serviced one at a time in a FIFO discipline by an
appropriate classifier (using, say, ARTL algorithm of Carpenter & Grossberg (3.51) in
which the average individual pattern processing time is 1/',u sec. We assume that the c~a,-
sification process is exponentially distributed, that is the probability distribution of the
classification/recognition time is given by the
f(s) =i e sI S >_0
- 0, otherwise
In this case, the relevant system and pattern traffic aggregate measures emerge as follows
(2).
P
L, the average buffer occupancy
36
We should strive for a system in which L° , W ° , and W" 0 are all small.
We could improve the system somewhat if we process the patterns con urrently by a
set of multiple classifiers. We assume that they are all equal in performance, i.e. capable of
classifying a single pattern in 1/, time on the average. There are at least two distinct
ways to process the patterns now.
We assume that the system, in this case, comprises k identical processors in a proces-
sor pool to process incoming patterns in FIFO mode service. In this case, all the incoming
patterns wait in a single queue served by k identical processors in the M/M/k service
mode. Whenever one of the k-machines becomes idle, it picks up the first paLtern it finds
at the input queue and attempts to classify it. The system organization is indicated below.
-- ~alf IleI-.-0
I®
A Processor Pool based
Classifier Model
Given thu. the arrival rate of the patterns into the system is still A, patterns s<ec..
and the service rate at any of the k given classifiers is also Poisson with a mean rate o:'.,
and that Ap < k ji, the equilibrium probability Po that the system is idle is given by
and, the equilibrium probability Pi that the system has i patterns in it is given by
A A
Po( PN
Lq A
k! [1 - (k"P )12
where, Lq is the average number of patterns in the processing queue waiting for one of the
k processors to process them, and,
wA~ L 1
W A +-, and
LA =Ap WA
where WA is the average residence time per pattern in the system, and LA respectively is
the system occupancy rate in the model A.
In this case, the input stream is split into k streams each of which is then looked after
by a single processor with a processing power, as before, at a rate of completing on the
average of one pattern in every u, second. This is schemetically as shown below.
At each arriving pattern stream, we assume, the interarrival time between any two
consecutive patterns, on the average, would be exponentially distributed with a mean of
kA sLice upon being available for processing it would be sent
AP
down to one of the processor
queue with a probability of 1/k. In this case, the system aggregates are as follows
LB AP
kA, - Ap
W _- k
kg, - Ap
In either case, whether we provide a bank of multiple servers for a single queue pat-
tern traffic or a multiple port service with each input port being serviced by a single pro-
cessor, the net result is the reduction of residence times, since both L and Li are less
38
than 1, for k >1.
Here, we assume the system to be in one of the two mutually exclusive states. We
assume that it is either at a learning state A whereupon it trains itself incrementally with a
stream of incoming feature vectors on the assumption that these vectors do ccnstitute a
set of legitimate patterns from the population of patterns in a statistical sense, or it is in
state B where its learning has stabilized to the extent of knowing precisely in how many
pattern classes the pattern domain is to be partitioned, and also, to a large extent, how the
corresponding pattern exemplars look like.
Assuming the system to be in state A, that is in the learning state, we let the incom-
ing feature vectors all collated in a time ordered sequence at the input buffer B serviced by
a controller. The controller, periodically, picks up k feature vectors from the buffer and
gets them shuffled at a Permutation block P such that the input string of vectors M'(t),
i'(t~l), ...,i'(t+k-1)} is changed to a random sequence of vectors of the form {-(tk), ji(tl),
...,i 4 (tm)} at the output of P. The controller then dispatches these vectors to the classifiers
at a rate of one vector per classifier per dispatch event. The classifier bank comprises k
independent classifying units. Each classifier Ci upon receiving the vectorT(t;) attempts to
learn from it using one of the algorithm outlined. After learning/classifying m consecutive
patterns, each classifier Ci sends its exemplar profile map to its output node Oi where the
output exemplar-profile corresponds to the pattern-class distribution for the last m
independent feature vector events. The output profile at Oi is a string of vectors along
with their pattern counts i.e. the number of patterns it has accommodated in a given class,
as shown below.
indicating that at the end of m consecutive pattern processings the plausible partition of
the feature-space is a h-tuple as shown above. The local output node Oi sends this feature
map to the next level up at the node Y where all such k independent feature maps arising
from below are consolidated with the consolidated feature map obtained at the last cycle.
If, on the other hand, the system is at state B, that is in the state of functioning as a
pattern recognizer, the controller sends the input pattern to be recognized directly to the
processor Y in an M/M/c processing mode assuming the feature vector arrival process at
the input buffer is Poisson and the pattern recognition time at Y is exponentially distri-
buted, respectively. The service times at the processors Ci are all assumed to be exponen-
tially distributed.
We assume that the switching over from the state A to state B takes place once the
system is triggered by one or more of the following conditions. It is
39
the system migrates into the state B.
Note that at any sampling event time during or after training episodes, after some minimal
time point to, the node Y has a specific exemplar profile. Even if one or more classifier
type processors fault the system is globally fault-tolerant. The basic structure is indicated
below.
Note that each elemental concurrent processor Ci could be, at some level, a feedback
processor, as shown in the diagram, in the event the pattern-class corresponding to an
input pattern requires further processing.
Oy(1) 41(1)
0-
k
S(1) = (y(1-_) U ( U V'() 0
40
and, given two class-distribution profiles
--- t -- --t
where, d(.,.) is the 'distance" between the exemplar patterns in the argument. One could,
if necessary, condition the union operation further by requiring that two classes may not
be allowed to coalesce if the resultant class were to emerge as too big a class by size.
Performance Profile
The above model, as proposed, attempts to provide a 'bias-free' architecture via con-
current multistream computation of the global class-distribution. If the patterns or rather
the feature vectors, during the training period, arrive into the system at a Poisson rate of
Ap
AP vectors/sec, each processing stream then receives input at a rate of k vectors/sec
which gets processed with an average residence time of W sec. per pattern, where
1 k
Aiy k c - Ap
Assuming that with a probability 0, a task returns to the end of the buffer for at least
one more time of processing, and with a probability of (1 - 0) it departs the classifier, one
41
obtains, at equilibrium,
A
+ (I-eO)I)p, = - + (1-O)Apn+I, n > 1
where p, is the equilibrium or the steady-state probability that the subsystem at the Ci
processor stream has n tasks in it (including one being processed, if any). The equilibrium
population densities come out to be
Ap n
Pi k~l_ ) (1 - n> 0
Li E oo-k(1-0)pcI\P A- Ap
00 npn
1
0
while the expected response time of a task in this stream, Wi, comes out to be
Lik
A Multidepth Pipeline
Classifier Processor
42
level , and assuming that, for all practical purpose, the system is decomposable as multi-
ple M/M/c subsystems, working in tandem, we have the effective pattern arrival rates in
each subsystem at the depth c as, at equilibrium,
Ap
A k(1-00)
in terms of which, the expected residence time of a pattern task, at each pipeline section
comes out to be
1
W =
Note that an infinite variety of parallel distributed architecture for neural-based net-
work systems could be proposed. In almost all cases, multiple neuron-type processors are
best suited to provide SIMD type architecture at some computational level. Organization
of machines based on SIMD type primitives as indicated above yield substantial perfor-
mance.
References:
43
DISCRETE FOURIER TRANSFORLMS ON HYPERCUBES
This paper presents a decomposition of the discrete Fourier transform matrix that iL MOD
explicit than what is found in the literature. It is shown that such a decomposition forms
an important tool in mapping the transform computation onto a parallel architecture such
as that provided by the hypercube topology. A basic scheme for power-of-a-prime length
The use of the Chinese Remainder Theorem in pipelining the computation of the
transforms, for lengths that are not a power of a prime, is also discussed
defined by
Y=FX
F = [wki
with i2 - - 1.
It is claimed that if n = 2", then the DFT matrix F is, up to a permutation matrix, the
(i) each block F -. i'a is square of order 2 (--i)+1= n/2 i'i and is of the form
44
I WrI
F-=]
-iwa
I WV 21
and there are exactly i-1 blocks; I is the identity matrix of order n/2;
2
(ii) for each block, r is the nonnegative integer whose -t-bit binary expansion is the reversal
of that of 2a.
We write the components of Y and X as y[k] and x[ji, for 0 < k, j _ n - i, so that
n-I
y[k] = F x[j] W!
i=0
We then write the indices k and j in binary, using -ybits of course, i.e
k= k Ik 2k_ 3 *..
k2kIk o
j = jly,7.2j3.3 ...
J2 Jljo
In decimal,
k = 2- 1k + 2- 2 k.
2 + 2k 1 + k0
1j -f + 2 - 2
j = 2 ^'" j-12 + 2jI + Jo
1 1 1 1
Y[k*-- ktk] = ( ( E
('" ( x .
J,2 ..-
JJo]"
0
jo=O j = j-,- --O j._=O
Vi
7 ..1 =W(2'k'Yi +2-' 2 k-. 2 4- +2k1 +ko)2'j-,-
2
=1_W(2-'k, 1+27 2 k-r2+ -- +2k,+ko)2- j,- 2
w~
2 wZ7(2k1+ko)j7-2
The computation can then be carried out in stages, where at each stage we calculate an
intermediate vector
x.f-
Note that in our notation, the first vector to be computed is X, and the last is X0
46
From the expansion of
we see that the components of the vectors X-,- and Xi are related by the equat ions
X x 1 ~ko l~ 1110]k-j-iI--i2..
W2"- (2'-Ik,-j+2"k,_Z+...+ko,) .
IN2w 2 Ii1+'2k-+.+o
i.e we may write
X.=t- F..i'X.
From the last expression above, we see that the matrix F-- can be blocked into 2 i4I
* submatrices. Each of the submatrices is determined by a unique %valueof tie hits Othu I'
k0 k 1..- ki 3 ki 2
47
We use these bits as our label a for each block, so the decimal value of a is
The size of each block Fq i'a is determined by a unique value of the bits that are less
significant than those of the label a. We can see that there are "-i+1 such bits, hence the
size of Fqti' a is
2t ' i+ l = 2/ 2 iY = n/2i- 1
F-t-i,of= n
For a fixed a = 2i k0 + 2i k1 + ... + 2ki. 3 + ki12 , the two possible values for ki_
Wv2'-(0+2'-2kj-2kj-3+ ""+ko)
Now the submatrices I and WrI, where I is the identity matrix of order n/2' are evident.
where r is given by
A similar analysis for ki. = 1 shows the two submatrices in the second half of F
48
It remains to show that the -- bit binary expansion of r is that of the reverse of 2ct.
Note that
22ki-3 + 2'kki- 2 + 20 • 0
i2i-3 3 +
2 -f2ki 2 + 2 -t' k + ... ++2tk0
2 'k
which is indeed r.
49
Note that because of the way the indices are ordered, at the end we get
Hence
F RR2 -yFoF IF 2 .. F -y.2F .f1
where R2 -Yis the permutation matrix corresponding to the permutation of the numbers
0, 1, ... , n-1 = 27-1
The decomposition of the DFT matrix into a product of sparse matrices, as shown in
section 1 provides the essential tool for mapping the computation of the DFT on
systems, including the Intel's iPSC, Amatek's S14, Ncube's NCUBE systems and the
Connection Machines from Thinking Machines. The hypercube has a simple recursive
definition:
(b) for d > 1, a d-cube, with 2d nodes, is obtained from two copies of a
communication link.
Note the very natural labeling of the nodes of the d-cube: simply precede the labels of the
nodes from the first copy of the (d-1)-cube with 0, and those from the second copy with a
©/
1
0-----------------------------------------------
0-cube 1-cube
50
DCO,
110
10010 Oi1
2-cube 3-cube
010- 01
4
-cube
(b) the layout is totally symmetric (by the way, the hypercube appears among proposed
(d) the topology exhibits a reasonable behavior in the presence of faulty processors w
communication links;
51
there are several paths connecting one node to another; in fact the number of disjoint
paths of mininum distance from node A to node B is the tfamming distance betweer tle
(f) many algorithms for diverse fields such as the physical sciences, signal processing,
numerical analysis, operations research and graphics have been successfully designed for
the hypercube.
A decomposition of the DFT matrix into a product of sparse matrices is visualized, in tli
usual manner, by means of a signal flow graph. For the length n = 23, the decompoi. MI
X Xz X1 Xo
- <
001
W°
aI W w*
100 X
101
113 Z
wy W 0
where the unscrambling of the index bits is not shown. Each column of nodes in the graph
52
corresponds to an intermediate vector X y- of section 1.
Suppose now that a 2-cube is available. Looking at the signal flow graphi, we call alloc au
the first 2 components (indexed 000 and 001) to processor 00, the next two components
(indexed 010 and 011) to processor 01, etc. In general two components indexed abO and
- W
001
WW
W N N
w
W
NW
W -, N
- - a
/W N.W
We see that the computation requires 3 steps (if we ignore the bit reversing); during the
first 2 steps, interprocessor communication is required; during the last step, only local datzi
53
4. The general case
The above example generalizes to the case of a length n = 2 -7 transform, for %vh1c1 , k-
cube is available. Each processor is assigned a k-bit binary label. The result of section 1
allows us to allocate components of X and any intermediate vector X .as follows: each
processor will have = "7-k components at any stage; the processor labeled p is
2 "1/2k 2
P= PklPk.2...Po
then the processor works on the components whose indices are
0
plPo 000 ....
Pk-lPk2 ...
plPo000 ....
Pk-lPk.2 ... O1
10
plPo000 ....
Pk-lPk-2 ...
11
plPo 000 ....
Pk-lPk2 ...
10
plPo 111l ....
Pk .1Pk 2...
11
plPoill ....
Pk-lPk 2 ..
From the discussion in section 1 it should be clear that the computation can be
decomposed into -ystages. The first k stages require interprocessor communication and the
last -y - k stages use local data only. During stage i, where 1 < i< k, each processor labeled
p = Pk-IPk_2...pP0
communicates with and only with the processor whose label differs frorn p ol ii In ii::
54
5. Decomposing with the Chinese Remainder Theorem
The scheme presented in the previous sections works for a length that is a power of 2,
should generalize easily to any power of a prime integer. The next most obvious question
how to decompose the computation if the length is not the power of a prime. In this case,
and the ring isomorphism provided by the so-called Chinese Remainder Theorem (CRT).
The CRT states that if the integers n I, n2 , ... nk are pairwise coprime, the ring Z n ar
Zn I1 xZ n2 x...xZ n
k
are isomorphic where n is the product of the ni, and the arithmetic in the ring : isordira:.
Zn - Z n I1 xZn 2 x...xZ n
k
is given by
( al, a 2 , ...,ak )- x
where
where u i = n/ni and ui is the inverse of u i modulo n i (which must exist since ui and n
ar, pmrwi>,
The essence of the isomorphism is that if n = nln29...nk' where the n.i r!>.,
then an integer between 0 and n-i may be thought of as a k-tuple of integer, tic !:-,
between 0 and n 1 -1, the second between 0 and n 2 -1, ... , the last between 0 and nk-1.
55
Now the relationship
n-1
y[a = 2 x[alwab
b=0
may be written
n-In2 -1 n-I
z(bl,b ,...,bk) W (a,,2 . ak)(b 1 ,b2, . bk)
y(al,a 2, ak) = 2
bl=0b 2 =0 b&=O
The product
(aj,a2, " ak)(61,b2, " bk)
Z xZ x...x Z
1 2 nk
as follows
hence if we let
W 0,1, 0...' 0)
W=(O,....
STAGE 1
nk-I
V ' b,-,ak)
*b,, - Z z(bt,b.,, , bk) WV'k
bk=0
S6
STAGE 2
nk -1 - I
Ub 1 ,bk-2,akI,-k) U k -at k -
bk-I =0
ISTAGE ij
nk -i +1 -1 "-I-
STAGE k n! 1_ 1 nb,,a
Stage i requires the computation of n/nk-i+1 DFTs each of length k-i-1. These DFTs rna%
be computed in parallel, once stage i-1 has been completely finished. Note that i' M r-I
A1) is the number of multiplications (resp. additions) required for the n.-transform, thew in
k M, k A,
total the n-transform requires n . - multiplications and n . - additions.
i- t= nt,
6. Future work
We need to find decompositions of the DFT matrix that allow better mappings of the
computation onto the hypercube. The scheme discussed in sections 1 through 4 for
example leave the load between the processors tinbalanced sorne ot the p r '-:-r- '.-
57
that needs to be formalized. It also calls for the (parallel) study of transforms of short
lengths.
7. References
11] Fox, G. et al, Solving problems on concurrent processors, vol. 1, Prentice-Hall, 1988.
[21 Kallstrom, M. and S.S. Thakkar, Programming three parallel computers, IEEE
[3] Oppenheim, A.V. and R.W. Schafer, Discrete-time signal processing, Prentice-Hall.
1989.
58
Conclusions
As stated at the outset there are several avenues that we are interested in pursuing
further. Of these, the use of algebraic transforms and associated coding techniques
in the design of neural-net based classifiers stands out as the area where we should
focus research efforts next. We want to come up with classifiers capable of real
time response and having low error rates. As part of this effort we need to identify.
using coding-theoretic knowledge, transforms suitable for operating environments
which have different design requirements: large number of classes, large block
length, high fault tolerance. After establishing that the proposed family of
architectures are worthwhile pursuing, we will next study how to make them more
flexible by considering their concatenation with learning architectures to allow for
the generation of new classes, this being achieved without prohibitively increasing
the overall response time of the composite adaptive network. The last phase of the
proposed effort involves the computation of the error rates for the most promising
of the resulting classifiers. Complete simulations need to be carried out on realistic
classification problems.
Concurrent with this report we are submitting a proposal for a follow-up project
based on these considerations.
59
MULTILAYER TRANSFORM-BASED
1. SUMMARY
This proposal is seeking support for the investigation of neural network classifiers
based on discrete transforms whose kernels can be identified with linear and nonlincar
algebraic binary codes. More precisely, the study will concentrate on the feasibility of
designing and implementing nonclassical classifiers that are multilayer neural nets,
where one or more layer transforms points in the pattern space into points in some
appropriate "frequency" domain, and where other layers classify the latter Points
through attributes of the transform used. Concatenation of the proposed networks With
experimented with.
2. PROJECT OBJECTIVES
The main objective of the project is to produce classifiers with low error rates and real-
time response. These two attributes are viewed by researchers as the most fundamental
characteristics of classifiers with regard to the application of pattern classification to
real-world problems such as speech processing, radar signal processing, vision and
robotics. This project will explore the possibility of achieving a low error rate by viewing
patterns in "frequency" domain.
Mappings from pattern space to frequency domain that are continuous must tc
devised, where continuous means that patterns that are close under the pattern space
60
metric will have their images close under the Hamming distance. The frequcncy
domain will then appear as a code space and pattern classification might be thought of
In the first phase of the project, the codewords will be "hard-wired" among the
interconnections of a neural network. The lower layers of this net will implement the
foregoing continuous mapping. They transform pattern vectors into noisy versions of
the codewords corresponding to the class labels. The upper layers of the net will use
coding-theoretic properties in code space to correct errors in the corrupted codewords,
thus identifying the class labels. Because of the hardwiring, the training time and the
operational response time should be relatively very short, the latter corresponding only
to the propagation delay among the layers of the neural net. Since it is not expected
that a single transform (i.e algebraic code) will accommodate all classification
situations, the identification of suitable transforms and their matching with different
operating environments (large number of classes, high fault-tolerance, high pattern
vector length, etc.) will also be studied. The large body of knowledge about algebraic
codes and decoding procedures will be put to use. It should be noted that many
transforms used in signal processing do arise from appropriate algebraic codes; for
The next phase of the project will investigate the possibility of introducing more
flexibility into the above architectures. Clearly for this architecture the class labels
must be known beforehand, the choice of codewords that identify the labels must be
built into the net, and learning from examples is nonexistent. By concatenating thc
rates, but of course the training time and operating response time might significantly
61
increase.
The last phase of the project will address the computation of the error rate for the
proposed classifier architecture in order to evaluate its place among proposed and
implemented classifiers. A difficulty that arises here is that the continuous mapping
used for going from pattern space to code space may introduce additional errors
(noise).
3. REFERENCES
F. J. McWilliams & N. J. A. Sloane, "The Theory of Error-Correcting Codes", North
Holland, (1977).
S. Thomas Alexander, "Adaptive Signal Processing', Springer-Verlag, (1986).
David G. Stork, "Self-Organization, Pattern Recognition, and Adaptive Resonance
Networks", J. of Neural Network Computing, (Summer 1989).
R. P. Lippmann, "An Introduction to Computing with Neural Nets", IEEE ASSP
Magazine, (April 1987).
62
MISViON
OF
ROME LABORATORY