Swartzlander 1984
Swartzlander 1984
5, OCTOBER 1984
,4/mtracf — This paper describes tbe deyelopmerst of a semicustom delay Software only and Software-PSP implementations are ade-
commutator circuit to support the implementation of high speed fast
quate when the spectral bandwidth is under 10 MHz.
Fourier transform processors based on the McCellan and Pnrdy radix 4
Custom processors achieve analysis bandwidths of 10-50
pipeline FFT algorithm. The delay commutator is a 108000 transistor
circuit comprising 12288 shift register stages and approximately 2000 gates MHz, but most are optimized for a specific application and
of random logic realized with 2.5 micrometer design rule CMOS standard would require extensive (and expensive) redesign to modify
cell technology. It operates at a 10 MHz clock rate which processes data at them to suit other applications. Thus general purpose
a 40 i%fHz rate. The delay commutator is suitable for implementing
computers with or without PSP augmentation are too slow
processors that compute transforms of 16, 64, 256, 1024, and 4096 (com-
while custom processors lack the required flexibility.
plex) points. It is implemented as a 4 bit wide data slice to facilitate
concatenation to accommodate common data word sizes and to use a Current signal processing systems require many diverse
standard 48 pin dual-in-line package. functions: transform computation, time and frequency do-
main vector processing, and general purpose computing.
We are developing a growing family of building block
I. INTRODUCTION
modules to facilitate the development and implementation
LTHOUGH the Cooley-Tukey FFT algorithm [1] of such systems on a semicustom basis. The result is the
developed nearly two decades ago has made it possible ability to quickly develop high performance signal
A
to apply digital signal analysis techniques to many applica- processing systems for a wide variety of algorithms. The
tions, many others (e.g., radar and sonar beam forming, use of predesigned and precharacterized modules reduces
adaptive filtering, communications spectrum analysis, etc.) cost, development time, and most importantly, risk. The
require both flexibility and speed that exceeds the present initial set of modules was described in 1983 [2]. The
state of the art. Currently, there are three approaches for modules defined include a data acquisition module, build-
signal processing: software implemented on general pur- ing block elements that are replicated to realize pipeline
pose computers, software implemented on a general pur- FFT and inverse FFT modules, a frequency domain filter
pose computer augmented with a Programmable Signal module, a power spectral density computational module,
Processor (PSP), and custom hardware development. and an output interface module.
The modules all have separate data and control inter-
Manuscript receivedApril 11, 1984; revisedJune 14, 1984. faces. The separation of the data and control is analogous
E. E. Swartzlander, Jr. and W. K. W. Young are with TRW Defense
SystemsGroup, Redondo Beach,CA 90278. to the Harvard mainframe computer architecture which
S. J. Josephk with AT&T Bell Laboratories, Allentown, PA 18103. uses separate data and instruction memories to eliminate
DATA { \ TR#JOFtM
INPUT
w COMPUTATIONAL ELEMENT CE
1
OELAY COMMUTATOR DC(XI
the “ von Neumann bottleneck.” In signal processing the the transform of long sequences (e.g., lK, 4K, or 116K
separation of data and control allows the simple data points) often requires complex logic, thereby mitigating the
interfaces to operate at high speed while the more flexible advantages of the VLSI realizations. They also lack the
and complex control interfaces operate at a slower rate. All flexibility to efficiently transform sequences of varying
data interfaces satisfy a common interface protocol so that lengths as may be required for many applications.
modules can be connected together to form architectures
that match the data flow of each specific system.
01234567 89101112131415
INPUT
DATA
f 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 3?
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
\\ /
v
INPUT 1
DELAY 2+ 4 STAGE OELAYj-
3+ 8 STAGE DELAY j-
4+2 STAGF DELAY ~
01234$67 89101112131415
16 17 18 19 20 21 22 23 24 25 26 2? 28 29 30 31
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
A A A A A A /
v v v v v v
REORDERING 1—
THROUGti
DELAY COMMUTATOR COMMUTATOR 2
x=4 3
4 )( 2?’ XX:5
012 3 16 17 18 19 32 33 34 35 48 49 50 51
4567202122 23 36 37 38 39 52 53 54 55
8 9 10 1~ 24 25 26 27 40 41 42 43 56 57 58 59
12 13 14 15 28 29 30 31 44 45 46 47 60 61 62 63
\
v
1 +12 STAGE DELAY
2 + 8 STAGE DELAY+
OUTPUT
DEL4Y 3+ 4 STAGE DEW}
012 3 16 17 18 19 32 33 34 35 48 49 50 51
RADIX 4 45672021 22 23 36 37 38 39 52 53 54 55
BUTTERFLY
8 9 10 11 24 25 26 27 40 41 42 43 56 57 58 59
DATA
12 13 14 15 28 29 30 31 44 45 46 47 60 61 62 63
v
{
2 1 STAGE DEL4Y
INPUT
DELAY 342 STAGE DELAY
4 +3 STAGE DELAY ~
012 3 16 17 18 19 32 33 34 35 48 49 50 51
456 7 20 21 22 23 36 37 38 39 52 53 54 55
8 9 10 71 24 25 26 27 40 41 42 43 56 57 58 59
12 13 14 15 28 29 30 31 44 45 46 47 60 61 62 63
REORDERING
W%WQ962?96Z?!
THROUGH COMMUTATOR
DELAY COMMUTATOR
x=1
04 812162024283236 404448525660
RADIX 4
BUTTERFLY 159131721 25 29 W 37 41 46 49 53 57 61
DATA 2 6 10 14 18 22 26 30 34 38 42 46 50 54 58 62
3 7 11 15 19 23 27 31 35 39 43 47 51 55 59 63
delay commutator reorders the data between computa- commutator switch, where they are switched to selected
tional stages as required for the FFT algorithm. data paths. Lastly, the data are deskewed through a second
The interstate data reordering required at stage i in the set of delay lines.
implementation of a 4“ point transform is a base 4 digit The routing of data that occurs in processing a 64 point
reversal [10] of the data elements in a 4 X 4“ matrix. When transform with the radix 4 pipeline FFT algorithm is
data enter the delay commutator they pass through four graphically charted in Fig. 2. Input data numbered O-63
parallel delay lines which skew the data. The first data path are shown on four parallel streams at the top of the figure.
receives no delay, the second receives a delay of 4“ – i‘- 1, the The action of the first delay commutator (set for X=4) is
third receives a delay of 2 x 4“ –‘ – 1, and the fourth receives shown. It transforms the four streams of data separated by
a delay of 3 X 4“ –‘ – 1. The data then pass through the 16 points into four streams where the data are separated by
SWARTZLANDER et a[. ! RADIX 4 DELAY COMMUTATOR 705
OUTPUT
ENABLE %
— o 768 WORD
1 41
SHIFT
~ 2 MUX —
REGISTER -&
3
● a
DATA OATA
INPUT < } OUTPUT
1 *
768 WORD
SHIFT
REGISTER
.+
*
*
5:1
MUX
—
o
1
2
3
41
MuX +
OELAY 9
●
/ t ~ r
LENGTH
‘3
CTR RESET ●
CTR PRESET ● /
‘2
PRESET ●
ENABLE
●
81
MUX
COUNTER WITH
TAPS AT EVEN STAGES
four points. These data are operated upon by a radix 4 and held in state O to disable the commutator switch
butterfly which does not change the data order. A second function. In this mode the chip provides fixed length
delay commutator (set for X=1) reorders the data to registers with delays of 256, 2 X 768, and 1280 which are
produce streams of adjacent data. This process is derived used to expand the delay commutator for 16384 point
and explained in greater detail in [10]. transforms. Data from the 4:1 multiplexer are output
through programmable length shift registers that are simi-
lar to the input registers.
IV. THE DELAY COMMUTATOR CIRCUIT Gate array, standard cell, and custom technologies were
considered for implementation of the delay commutator.
Careful examination of our initial (off the shelf technol- An optimum balance between high circuit performance
ogy) FFT module design revealed that much of the com- and low implementation cost was achieved using the AT&T
plexity was due to the delay commutator element. Initial Bell Laboratories’ polycell (standard cell) CMOS technol-
complexity estimates are 80 commercial integrated circuits ogy. This technology was selected because it is well suited
for the computational element and 180 circuits for the to the development of VLSI with high density shift reg-
delay commutator. The disparity in complexity arises be- isters and random logic. It is a twin tub 2.5 micrornr,ter
cause of the difficulty of realizing shift registers that can be CMOS technology with chain-stops for device isolation
set to a variety of lengths as required for the various and an epitaxial layer for latch-up protection [11].
delays. The most efficient approach involved simulating a The delay commutator circuit contains 12288 shift reg-
delay line by using a RAM with write and read addresses ister stages and about 2000 gates of random logic, fo]r a
displaced by a constant (i.e., the length of the simulated total complexity of 108000 transistors. At a clock rate of
delay line). In view of the high complexity of the delay 10 MHz, the power dissipation is under 1/2 W. The chip
commutator, development of a semicustom implementation size is 340X 376 roil. The very high speed integrated circuit
was undertaken. The resulting design of the delay commu- (VHSIC) program uses functional throughput rate (F’TR)
tator is a 4 bit wide slice that uses programmable length as a measure of circuit performance [12]. FTR is definecl as
shift registers and a 4 X 4 switch as shown in Fig. 3. Data the product of the number of gates times the chip clock
enter through shift registers with taps and multiplexer to rate divided by the area. By this measure (assuming a
set the delay at 1, 4, 16, 64, or 256 ( = X) in the uppermost conversion rate of 3 transistors to 1 gate), the delay com-
input register and multiplies of 2X and 3X in the middle mutator FTR is 4.3X 1011 gate. Hz/cm*. Although not
and lower registers, respectively. Four 4:1 multiplexer satisfying the important VHSIC environmental require-
implement the commutator function under the control of ments, this is impressive performance for a semicustom
the programmable rate counter. The final 2 bit counter/de- circuit. The chip is shown on Fig. 4. Each of the four bit
coder that controls the multiplexer settings can be reset slices is constructed with input registers in a column,
706 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. SC-19, NO. 5, OCTOBER i!)8A
VDD
DATA
IN
--J
1
7 5V
~20ns-1
UPPER TRACE 10 MHz CLOCK
Fig. 7. Clock and data output waveforms (10 MHz clock rate).
708 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. SC-19, NO. 5, :TOBER 1984
TABLE I
COMPLEXITY REDUCTION ACHIEVED WITH THE DELAY
COMMUTATOR CIRCUIT
COMPUTATIONAL ELEMENT 6 CARDS AT 80 CKTS, CARD 480 CKTS 6 CARDS AT 91 CKTS CARD 546 CkTS
TOTAL
I 11 CARDS 1375 CKTS 6CAR0S 546 CKTS
COMPUTATIONAL EL GMENT 7 CARDS AT 80 CKTS CARDS 560 CKTS 7CARDSAT91 CKTS CARD 637 CKTS
DELAY COMMUTATOR 6CARD5AT 179 CKTS CARDS 1074 CKTS 1 EXTENOEDDC,1O24, 33 CKTS
Proc. IEEE In~. Conf. Acoustics, Speech, and Signal Processing, Editor of the books Computer Design Development (Rochelle Park, NJ:
. 1081-1083.
1982. . .DD. Hayden Book Co., 1976) and Computer Arithmetic (Stroudsburg, PA:
[6] I-L J. Nussbattmer, Fast Fourier Transform and Convolution Al- Dowden, Hutchinson & Ross, 1980).
gorithrns. New York: Springer-Verlag, 1982.
Dr. Swartzlander is an Editor of the IEEE TRANSACTIONS ON COM-
[7] J. H. McClellan and R. J. Purdy, “Applications of digital signal
PUTERS. He is a member of the Association for Computing Machinery and
processing to radar, “ in Applications of Digital Signal Processing,
A, V. Oppenheim, Ed. Englewood Cliffs, NJ: Prentice-Hafl, 1978, belongs to Eta Kappa Nu, Sigma Tau, and Omicron Delta Ktlppa
Ch 5 honorary fraternities, and is a registered Professional Engineer in Alabama,
[8] H. L Groginsky and G. A. Works, “A pipeline fast Fourier California, and Colorado.
transform,” IEEE Trans. Comput., vol. C-19, pp. 1015–1019, 1970,
[9] J. A. Eldon and C. Robertson, “A floating point format for signal
processing,” in Proc. IEEE Int. Conf. Acoustics, Speech, and
Signal Processing, 1982, PP. 717-120.
[10] L R. Rabiner and B. Gold, Theoiy and Applications of Digital
Signal Processing. Englewood Cliffs, NJ: Prentice-Hall, 1975, ch.
10.
[11] L. C. Parrillo et al., “Twin-tub CMOS II—An advanced VLSI
Wendell K. W. Young (M83) received the IB. S.
technology~’ in Proc. Int. E[ectron Devices Meeting, 1982,pp.
degree in information computer sciences from
..
706–709 ..
[12] L. W. Sumney, “ VHSIC: A status report,” IEEE Spectrum, vol. 19, the University of Hawaii, Honolulu, in 1978.
pp. 34–39, Dec. 1982. In 1979 he joined TRW, Redondo Beach, CA,
[13] E. E. Swartzlander, Jr., and G. Haflnor, “Fast transform processor as a Member of the Technical Staff conducting
implementation,” in Proc. IEEE Int. Conf. Acoustics, Speech, and anafysis work on the post-flight navigation data
Signal Processing, 1984, pp. 25 A.5.I-25A.5.4. system for the Space Shuttle program. In 1980,
[14] G. W, Preston, “The very large scafe integrated circuit; Amer. he joined Magnavox Government Systems where
Scientist, vol. 71, pp. 466-472, 1983.
he was engaged in the development of integrated
navigation systems. He is currently a Sys terns
Engineer in the TRW Defense Systems Group
involved in the development of advanced signal processirig circuits and
Earl E. Swartzlander, Jr. (S’64-M’72-SM79) systems.
received the B. S.E.E. degree from Purdue Uni-
versity, Lafayette, IN, in 1967, the M. S.E.E. de-
gree from the University of Colorado, Boulder, in
1969, and the Ph.D. degree from the University
of Southern California, Los Angeles, in 1972. He
obtained his doctorate in computer design with
the support of a Howard Hughes Doctoraf Fel-
lowship. Saul J. Joseph (S’77-M’80) received the B. S.E.E.
He is currently the Manager of the Advanced degree from Rutgers University, New Brunswick,
Development Office in the Systems Engineering NJ, in 1978 and the M. S.E.E. degree from Lehigh
Operations of the TRW Def;nse Systems Group. - This inv;lves th~ University, Bethlehem, PA, in 1979.
conceptual definition and development of advanced signal processing In 1978 he joined AT&T Bell Laboratories. He
systems. His current activity focuses on the related issues of algorithms, is currently a Member of the Technical Staff in
architecture, and implementation with off-the-shelf as well as custom the VLSI Design Laboratory in Allentown, PA.
VLSI. He has directed the development of a variety of advanced systems, He is responsible for the design of semicustom
and has developed the architectural and functional design of VLSI CMOS circuits using the Polycell family of
components which are in varying stages of design, development, and standard cells.
production. He has published over 50 papers in the fields of computer Mr. Joseph is a member of Tau Beta Pi, Eta
architecture, VLSI implementation, and computer arithmetic, and is the Kappa Nu, and the National Socie~y of Professional Engineers.