0% found this document useful (0 votes)
63 views13 pages

Synthesis of Fpga Synthesis of Fpga - Based FFT Based FFT Implementations Implementations Implementations Implementations

This document proposes a systematic approach for synthesizing FPGA-based FFT implementations using two unrolling techniques: inner loop unrolling and outer loop unrolling. Inner loop unrolling realizes parallelism within each FFT stage by allocating multiple processing cores. Outer loop unrolling realizes pipelining by instantiating multiple processing cores across stages. The techniques are evaluated based on cost and performance metrics like usage of FPGA slices and block RAMs. Experimental results show combinations of the techniques can achieve cost-optimized FFT implementations for different performance levels.

Uploaded by

Pankaj Joshi
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views13 pages

Synthesis of Fpga Synthesis of Fpga - Based FFT Based FFT Implementations Implementations Implementations Implementations

This document proposes a systematic approach for synthesizing FPGA-based FFT implementations using two unrolling techniques: inner loop unrolling and outer loop unrolling. Inner loop unrolling realizes parallelism within each FFT stage by allocating multiple processing cores. Outer loop unrolling realizes pipelining by instantiating multiple processing cores across stages. The techniques are evaluated based on cost and performance metrics like usage of FPGA slices and block RAMs. Experimental results show combinations of the techniques can achieve cost-optimized FFT implementations for different performance levels.

Uploaded by

Pankaj Joshi
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

SYNTHESIS OF FPGAFPGA-BASED FFT IMPLEMENTATIONS

Hojin Kee1, Newton Peterson2, 2, Shuvra 1 J Jacob b Kornerup K Sh S. S Bhattacharyya Bh h of Electrical and Computer Engineering, University of Maryland, College Park, 20742, USA. 2National N ti l Instruments I t t Corporation, C ti Austin, A ti 78759, 78759 USA. USA
1Department

Overview
Propose a systematic approach for synthesizing fieldprogrammable gate array (FPGA) implementations of fast F Fourier i transform t f computations. t ti Proposed approach is composed of two orthogonal techniques FFT inner loop p unrolling g and outer loop p unrolling g to perform design space exploration in terms of cost and performance. Achieve cost-optimized cost optimized FFT implementations, subject to user-specified performance levels. Proposed techniques that can be retargeted to different kinds of FPGA devices.

Introduction
Fast Fourier transform (FFT) computation potentially requires multi-cycle processing blocks as its computational complexity is blocks, O(N*logN), where N is the number of inputs. Proposed approaches. Outer O t loop l unrolling lli : R Realizing li i pipelining i li i by instantiating multiple processing cores across FFT butterfly stages. Inner loop unrolling : Realizing parallelism by allocating multiple cores within each stage. Our synthesis approach is prototyped in National Instruments LabVIEW FPGA 8.5. Cost metric
Usage of FPGA slices 1 of Block RAMs Usage

Related Works
Ma [2] developed an efficient method for in-place memory management in FFT implementation, but this approach is restricted t i t d to t a single i l b butterfly tt fl unit. it Nordin et al. [4] presented a parameterized soft core generator for the FFT based on the Peace FFT algorithm g with the stride permutation approach proposed by Takala et al. [5]. Jackson et al. [6] proposed a systolic structure to provide for high throughput FFT implementation implementation. Distinguishing aspect in our approach : Realization of data
parallelism and pipelining with a carefully-configured address generator. t
No special permutation structures for butterfly operations. Efficient utilization of FPGA slices subject to user-defined performance.
3

Unrolling techniques
A basic FFT core (BFC) provides dedicated hardware for one butterfly operation. K- times throughput improvement
Running BFCs simultaneously across stages. Incorporating p gp parallelism inside the BFC within a given stage.

Two unrolling techniques show different cost functions in terms of usage of FPGA slices or BRAMs. The two approaches should be considered jointly for cost-efficient FPGA-based, FFT implementation.
4

Outer Loop Unrolling


In unrolling factor k > 1,
Instantiates k BFCs. (k-1) BFCs take The last BFC takes loop iterations in each. loop p iterations.

This approach introduces k identical copies of the sub-FFT core. It is expected that a factor k of increase in hardware cost results. Trade-offs associated with outer loop unrolling are complemented by inner loop unrolling. unrolling

Inner Loop Unrolling (Read)


Indices of two inputs, u and l, for a butterfly unit in the pth stage are identical, except for the p-th bit in their binary patterns. Define two functions Let x1=110 and x2 = 01100 RL(x2, 2) = 10001, RR(x1, 1) = 011 CONCAT(x1, x2) = 11001100 Read 2k inputs for k BFCs with a single address. Ap = an-r-2 an-r-1 a0 : Address for all inputs. B0p = br br-1 b1 0 : Index of 1st DM bank for BFC B1p = br br-1 b1 1 : Index of 2nd DM bank for BFC

Address = an-r-2 an-r-3 a0 bpRAM = br br-1 b1 0

BFC

bpRAM

= br br-1 b1 1

u or l = RL(CONCAT(RR(Ap, p),Bp),p) = an-r-2an-r-3 apbrbr-1b0ap-1ap-2a0

(1)

Inner Loop Unrolling (Write)


Outputs in the p-th stage should be written to a DM bank so th t it will that ill b be ready d f for th the read di in th the (p+1)1) th stage. t The destined DM bank index and its associated address for writing g butterfly y output p data can be g generated by y an inverse mapping of (1).
u or l = RL(CONCAT(RR(Ap, p),Bp),p) = an-r-2an-r-3 apbrbr-1b0ap-1ap-2a0 = RL(CONCAT(RR(Ap+1, p+1),Bp+1),p+1) Ap+1=an-r-2an-r-3ap+1b0ap-1ap-2a0 Bp+1 = apbrbr-1b1

Inner Loop Unrolling (Write) cont.


Address = an-r-2 an-r-3 a0
bpRAM = br br-1 b1 0 = 12 = 1100
BFC

ap

Destined BRAM index Bp+1 = ap br br-1b1 Destined Address Ap+1=an-r-2an-r-3ap+1b0ap-1ap-2a0

switch
output address =1 0 1 0

reg
output address =1110

bp+1RAM = (ap=0) br br-1 b1 =0110 = 6 bp+1RAM = (ap=1) br br-1 b1 =1110 = 14

bpRAM = br br-1 b1 1 = 13 = 1101

reg

br br-1 b1 = 1 1 0

Simple interconnection network

Cost/Performance Analysis
Cost model for outer loop unrolling/ inner loop unrolling. We calibrate the model using synthesis results.
uinner = sinner*uinitial(kinner-1)+uinitial uouter = souter*uinitial(kouter-1)+uinitial
uinner/uouter uinitial kinner/kouter sinner/souter unrolling : Amount of utilization after inner/outer loop unrolling : Amount of utilization without loop unrolling : Unrolling factors : The slope p of the linear p plots from synthesis y for inner ( (outer) ) loop p

Analytic combined analytic cost function.


ucombined = souter*u uinner(kouter-1)+u 1) uinner kcombined = kouter*kinner
ucombined : Amount of utilization after a combination of inner/outer loop unrolling kcombined : Speedup S d resulting lti f from such h a combination bi ti
9

Experimental Results
Figure 3 reports the FPGA resource utilization when the target speedup is 6. (kinner, kouter)=(3, 2) shows the best utilization performance in the target speedup. This matches to the results from the analytic cost function we analyzed. For streaming FFT performance, our approach requires 23% less FPGA slices compared to the Xilinx core, but 140% more BRAMs. For the sequential performance level, our approach requires 30% fewer slices, and 17% more BRAMs.

10

Conclusion
Our approach incorporates efficient FFT address generation and memory management, and applies two orthogonal loop unrolling methods et ods to op provide o de a tu tunable ab e trade-off ade o be between ee pe performance o a ce a and d FPGA resource costs. We also develop an analytical approach for high level design space exploration, which allows one to estimate the most resourceresource efficient FFT architecture configuration for a given throughput constraint and a given critical target resource. A distinguishing characteristic of our approach approach, compared to commercially available FFT IP cores, is that we provide a systematic method to generate an FPGA-based FFT architecture while taking into account trade trade-offs offs between performance and cost.

11

References
[1] J. W. Cooley and J. W. Tukey, An algorithm for the machine calculation of complex Fourier series, Mathematics of Computation, Vol. 19, No. 90, 297-301, 1965. [2] Y. Ma, An Effective Memory Addressing Scheme for FFT Processors, IEEE T Transactions ti on Signal Si lP Processing, i vol. l 47 47, I Issue 3 3, pp. 907 907-911, 911 M March h 1999 1999. [3] W. Wolf. FPGA-Based System Design. Prentice Hall, 2004. [4] G. Nordin, P. A. Milder, J. C. Hoe, M. Puschel, Automatic Generation of Customized Discrete Fourier Transform IPs IPs , Design Automation Conference Conference, pp pp. 471471 474, 474 2005. [5] J. Takala, T. Jarvinen, P. Salmela, and D. Akopian. Multi-port interconnection networks for radix-r algorithms. In Proc. IEEE Intl. Conf. Acoustics, Speech, Signal P Processing, i 2001 2001. [6] P. A. Jackson, C. P. Chan, J. E. Scalera, C. M. Rader, and M. M. Vai, A Systolic FFT Architecture for Real Time FPGA Systems, High Performance Embedded Computing Workshop, 2004

12

You might also like