0% found this document useful (0 votes)
26 views4 pages

Li 2009

The document presents a novel parallel pipeline FFT processor tailored for MB-OFDM UWB systems. It employs a Radix 22 algorithm and parallel pipeline architecture to provide a small-area and low-power solution that meets ECMA requirements. Synthesis results show it achieves the required 264MHz clock frequency with 39,000 gates and an area of 181,140um^2 in ASIC 90nm technology.

Uploaded by

hy zhao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views4 pages

Li 2009

The document presents a novel parallel pipeline FFT processor tailored for MB-OFDM UWB systems. It employs a Radix 22 algorithm and parallel pipeline architecture to provide a small-area and low-power solution that meets ECMA requirements. Synthesis results show it achieves the required 264MHz clock frequency with 39,000 gates and an area of 181,140um^2 in ASIC 90nm technology.

Uploaded by

hy zhao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

A Radix 22 Based Parallel Pipeline FFT Processor

for MB-OFDM UWB system


Nuo Li and N.P. van der Meijs
Faculty of Electrical Engineering, Mathematics and Computer Science (EEMCS)
Delft University of Technology, Delft, Netherlands
Email: [email protected]

Abstract—This paper presents a novel parallel pipeline FFT II. BACKGROUND


processor especially tailored for Multiband Orthogonal Fre-
quency Division Multiplexing (MB-OFDM) Ultra Wideband A. The Requirements of FFT for Multiband OFDM System
(UWB) system, which was defined by ECMA International. The
proposed Radix 22 Parallel Pipeline processor, which employs two According to the ECMA-368, the required sampling fre-
parallel data path Radix 22 algorithm and single-path delay feed- quency is 528MHz and the total number of subcarriers, which
back (SDF) pipeline architecture, is a small-area and low-power- determines the FFT size, is 128. The time period available for
consumption solution for MB-OFDM UWB system. Both FPGA
Xilinx Virtex4 and ASIC 90 nm technology, 1V supply voltage
the IFFT and FFT is 242.42ns, which is the inverse of sampling
targeted synthesis results of this architecture are presented. It is frequency multiplying the FFT size (TF F T = 128 f1s ). There
shown from the results that, due to the revised algorithm and are 37 zero padded suffix samples, which take 70.08ns. So the
novel architecture, the required clock frequency is 264MHz to total symbol interval is 312.5ns (TSY M = TF F T + TZP S ).
meet the ECMA requirement. Meanwhile, the required gates are The word length choice is a critical issue for FFT processor
39000 without testing block and the corresponding area is 181140
μm2 .
design. The trade-off between chip area consideration and
signal to quantization noise ratio (SQNR) directly determines
the choice. Based on the analysis of [5] and [8], the word
I. I NTRODUCTION length is chosen to be 10 bits in this paper for simulation and
comparison with their designs.
Ultra-Wideband (UWB) Technology brings the convenience
and mobility of wireless communications to high-speed inter-
B. The Selection of FFT Algorithms
connects in devices through out the digital home and office
[1]. Multiband-OFDM standard is one solution for UWB The traditional radix 2 FFT algorithms have simple structure
technology. A proposal for Multi-band OFDM UWB standard and clear data flow, which are easy to implement and are
is published by IEEE 802.15 3a study group [2]. In December suitable for generic FFT implementation. Nevertheless, these
2007, the second revised version Standard ECMA-368 was algorithms need large memory to store data at inner stages,
released, which specified physical layer (PHY) and medium which require large power and area consumption. Nowadays,
access control layer (MAC) of the UWB technology based on there are two trends for FFT implementation of OFDM system,
Multiband-OFDM [3]. the mixed radix algorithms, such as [7] and the pipeline
Some key issues need to be solved for designing CMOS structure based algorithms, such as [9]. Based on extensive
based Multiband-OFDM UWB solution in support of the low algorithm analysis and selection, the proposed design employs
power requirement. One of the issues focuses on its FFT (Fast the Radix 22 algorithm developed by He and Torkelson [10],
Fourier Transform) block, which takes 25% design complexity which integrates the twiddle factor decomposition every two
of the total digital baseband transceiver [4]. Although many stages. The Radix 22 algorithm has the same multiplicative
results have already been published in this research area for the complexity as radix 4 algorithm, but retains the butterfly
past few years [5], [6], [7], the area and power consumption structure of radix 2 algorithm, which is very suitable for ASIC
of the FFT block still need to be improved since this system implementation.
targets for the wireless portable devices. Therefore, this paper The detailed algorithm deduction can be found in [10]. Its
focuses on the area and power consumption improvement application to 8 point FFT is used here to briefly explain the
under the ECMA-368 standard requirements. Section II de- algorithm, which is shown in Figure 1. In this application the
scribes the requirements for the FFT block and the algorithm Radix 22 algorithm is only used once for the first two stages,
which the proposed design is based on. Section III focuses because 8 point DFT can only be decomposed once by radix
on presenting the proposed FFT solution from algorithm, 4. For the last stage, normal radix 2 DIF algorithm is used. By
architecture, and implementation level respectively. Section using Radix 22 algorithm, complex multiplication of the twid-
IV shows the synthesis results both targeted for FPGA and dle factor in the first stage is changed into multiplying (−j).
ASIC implementation. Meanwhile, the comparison with other Therefore, in a pipeline structure, one complex multiplier can
published implementations is also presented. be reduced for 8 point FFT.


  
      
Fig. 1. Radix 22 based parallel FFT algorithm data flow

III. T HE PROPOSED PROCESSOR


The proposed processor is described from the algorithm,
architecture and implementation level respectively.
A. The Revision in the Algorithm Level
After the analysis of the normal Radix 22 algorithm, it is
found that the input data can also be separated into the odd and
even parts and these odd and even parts are not mixed until the
last stage. It is one of the key points of proposed processor,
which can be effectively used for architecture design in order
to reduce the working frequency and used registers.
Eight point FFT data flow is again used here to illustrate the
changes, which are also shown in Figure 1. The dashed lines
show the odd input data flow while the solid lines show the
even input data flow. For the first and second stage, there is no
cross between the dashed lines and solid lines, which means
the even and odd input data can be separately processed in
Fig. 2. The 128 point parallel Radix 22 based algorithm data flow
the first and second stages. Only in the last stage, the dashed
lines and solid lines are crossed which means that the even
and odd data should be mixed to process.
BF1 means butterfly type 1, which consists of four 2-to-1
The 128 point parallel algorithm data flow with twiddle
multiplexers and four adders. BF2 means butterfly type 2,
factor position is shown in Figure 2. The horizontal lines
which includes extra real and imaginary parts switching and
are not shown here. The input data and twiddle factors are
iversing because of the (−j) multiplication required by Radix
separated into the even and odd data, which are processed
22 algorithm. First, the input data are streamed in and handled
especially through the first six stages and only to be combined
by demultiplexer. These data are processed in the even and odd
in the final seventh stage. Please note that the output data are
parts of the architecture, where the dashed arrow lines stand
ultimately produced in bit reversed order.
for the data flow of odd data and the solid lines show the even
B. Architecture Level data. For each odd and even part, single-path delay feedback
From the previous analysis, employing two-path parallelism (SDF) pipeline structure [10] is used to process data separately.
in the first six stages is proper for the structure design. Because There are three controllers which produce the control signals
these six stages can process the even and odd input data and the addresses for reading the twiddle factor from the ROM.
separately and the last stage, the seventh stage, needs to mix The even and odd parts of each stage share the same controller.
the even and odd data. Nevertheless, there are some extra There are five complex multiplications in the architecture. In
requirements for this architecture design. First, a demultiplexer the sixth stage, the even part outputs do not need multiplication
is required to separate the input data into the even and odd and twiddle factor storage, which can be found in Figure 3.
parts. On the other hand, the controller can be shared for both The reason is that, after twiddle factor separation in this stage,
even and odd path. Special care should be taken to generate all the twiddle factors in the even part become constant 1.
the right control signals for the last stage such that the even Therefore, no multiplication is required.
and odd parts can be combined in the proper way.
C. Implementation Level
Figure 3 shows the proposed parallel pipeline architecture.
It has seven stages and consists of demultiplexers, circu- As can be seen from Figure 3, there are seven stages. Based
lar buffers, ROM, complex multipliers, and butterfly units. on the required control, it is advantage to combine the stages


Fig. 3. The parallel Radix 22 based pipeline architecture

1 and 2, stages 3 and 4, and stages 5 and 6 to three common following N4 cycles, control signal I is set to one to enable
controller blocks. These common controller blocks all have a the butterfly function in stage 1. At the same time, the stage 2
structure as shown in Figure 4. Therefore, the whole parallel reads in the N8 data outputs of the stage 1, which is controlled
architecture can also be divided into the first three common by control signal II. The next N8 cycles, butterfly II of stage2
controller blocks, the last block, and the arithmetic blocks. The works and control signal II equals one. The data flow analysis
arithmetic blocks are composed of five ROMs and complex is shown in the Figure 5.
multipliers.

Fig. 5. The operation modes of the block

The last block only includes the seventh stage. Because the
odd and even data need to be commutated, two demultiplexers
seem to be required to switch the data, as shown in the
Figure 3. However, this can be improved by analyzing the
scheduling of the last stage. It can be found that only one
butterfly is working per clock circle and the first output data
of the even path will be processed with the first output of the
odd path of the 6th stage. As long as the timing is matched,
the even path outputs will be processed with the odd path
ones correspondingly. Therefore, the two demultiplexers are
Fig. 4. The common controller block not necessary and only one butterfly in the last stage is required
to process the data. The modified structure of the last stage
The basic idea of the data flow in these common controller and interface with previous stage is shown in Figure 6.
N
blocks is that the stage 1 repeats after calculating r2 r data,
N
and the stage 2 repeats after calculating r2r+1 data, where r
IV. I MPLEMENTATION AND RESULT ANALYSIS
(r = 1,2,3) is the index of the common controller blocks and
N is the FFT size. Only one counter is used to produce the A. FPGA Implementation
control signal I and II for both stage 1 and stage 2. For the The proposed design is synthesized and implemented by
first common controller block, first, control signal I is set to Xilinx ISE which is targeted for FPGA Xilinx Virtex4 im-
zero to let the N4 data be read into the stage 1, and in the plementation. The arithmetic blocks are directly mapped to


TABLE II
T HE ASIC I MPLEMENTATION C OMPARISON

proposed im- [8] [12]


plementation
Technology 90 nm, 1 V 0.18 μm, 1.8 V 0.18 μm, 1.8 V
Clock frequency (MHz) 264 450 250
Parallel data format 2 data-path 2 data-path 4 data-path
Algorithm Radix 22 Radix 24 Mixed Radix
Word length (bits) 10 10 10
Complex multipliers 5 2+0.41 2+2.48
Registers 128 190 -
Gates 38540 70000 -
Area (μm2 ) 181140 - 2466382
Fig. 6. The improved version of the 7th stage Area (μm2 ) scaled for 181140 - 616595.5
90 nm

DSP48 components in Xilinx Virtex4. Table I is the perfor- V. C ONCLUSION


mance of the proposed implementation and the comparison
with [7]. The table clearly shows the reduced resource count In this paper, a novel parallel pipeline FFT processor is
of the proposed design compared with the implementation in designed for the ECMA-368 standard. Our architecture is
[7]. The reason is that the proposed design employs far less based on a revised version of the Radix 22 algorithm. Our
memory blocks and complex multipliers. revision amounts to restructuring of the associated signal flow
graph into an even and odd part. As such, it not only achieves
the low multiplier count of the standard 22 algorithms, but
TABLE I also a 50 % reduction of the clock frequency and the lowest
T HE COMPARISON WITH [7] circular buffer count compared to the traditional SDF architec-
tures. Both FPGA and ASIC targeted synthesis results of this
[7] proposed
Word length (bits) 11 10 architecture are presented. The results show that the required
Total Number Slice Registers 7390 717 area is dramatically reduced based on the proposed design.
*Number used as Flip Flops 3860 457
Total Number of 4 input LUTS 12749 2230 R EFERENCES
Number of DSP48s 48 20
[1] INTEL, “Ultra-wideband (uwb) technology,”
http://www.intel.com/technology/comms/uwb/.
[2] e. a. A. Batra, “Multi-band OFDM physical layer proposal for IEEE
The used word length is lower than [7]. However, even when 802.15 Task Group 3a,” Tech. Rep., IEEE P.802.15-04/0493r0, 2004.
the word length of proposed design is increased to 15, the total [3] Standard ECMA-368: High Rate Ultra Wideband PHY and MAC Stan-
dard 2nd Edition.
equivalent gate count is still much lower than [7]. At 15 bits, [4] A. Batra, J. Balakrishnan, G. Aiello, J. Foerster, and A. Dabak, “Design
the total number slice registers, 4 input LUTS and DSP48s of of a multiband OFDM system for realistic UWB channel environments,”
proposed design is 1052, 3600, and 20 respectively. Microwave Theory and Techniques, IEEE Transactions on, vol. 52, no. 9,
pp. 2123–2138, Sept. 2004.
[5] Y.-W. Lin, H.-Y. Liu, and C.-Y. Lee, “A 1-GS/s FFT/IFFT processor
B. ASIC targeted results for UWB applications,” Solid-State Circuits, IEEE Journal of, vol. 40,
no. 8, pp. 1726–1735, Aug. 2005.
The proposed design is also synthesized by Synopsys De- [6] R. Chidambaram, “A scalable and high-performance FFT processor,
sign Compiler which is targeted for ASIC implementation. optimized for UWB-OFDM,” Master’s thesis, Delft University of Tech-
nology, 2005.
The synthesis library is Faraday 90nm standard cell library [7] N. Rodrigues, H. Neto, and H. Sarmento, “A OFDM module for a
[11], which is tailored for UMC 90 nm logic LL-RVT (lowK) MB-OFDM receiver,” Design & Technology of Integrated Systems in
process. During the implementation stage of our processor, [8] Nanoscale Era, 2007. DTIS. International Conference on, pp. 25–29,
Sept. 2007.
was published, which employed the similar parallel structure. [8] J. Lee and H. Lee, “A High-Speed Two-Parallel Radix-24 FFT/IFFT
However, there are some key differences between these two Processor for MB-OFDM UWB Systems,” IEICE Trans Fundamentals,
architectures. Specifically important differences are in the first vol. E91-A, no. 4, pp. 1206–1211, 2008.
[9] E. Saberinia, K. C. Chang, G. Sobelman, and A. H. Tewfik, “Imple-
and last stages where the proposed design reduces the number mentation of a Multi-band Pulsed-OFDM Transceiver,” J. VLSI Signal
of shift registers and the latency of the processor. Table II Process. Syst., vol. 43, no. 1, pp. 73–88, 2006.
is the performance of the proposed implementation and the [10] S. He and M. Torkelson, “A new approach to pipeline FFT processor,”
Parallel Processing Symposium, 1996., Proceedings of IPPS ’96, The
comparison with other start-of-the-art designs. The table shows 10th International, pp. 766–770, Apr 1996.
that the number of used gates of the proposed design is only [11] FARADAY, FSD0A A 90 nm Logic SP-RVT(Low-K) Process. FARA-
55% of [8]. If 180 nm technology would be linear scaled to 90 DAY Technology Corporation, 2006.
[12] T. Chakraborty and S. Chakrabarti, “A reduced area 1 GSPS FFT design
nm, the area is reduced by a factor of 4. Hence, the design of using MRMDF architecture for UWB communication,” in Circuits and
[12] in 180 nm would compare to a area of 616595.5 μm2 in Systems, 2008. APCCAS 2008. IEEE Asia Pacific Conference on, 30
90 nm technology, which is still much larger than the proposed 2008-Dec. 3 2008, pp. 1128–1131.
design.



You might also like