0% found this document useful (0 votes)

63 views13 pages

Synthesis of Fpga Synthesis of Fpga - Based FFT Based FFT Implementations Implementations Implementations Implementations

This document proposes a systematic approach for synthesizing FPGA-based FFT implementations using two unrolling techniques: inner loop unrolling and outer loop unrolling. Inner loop unrolling realizes parallelism within each FFT stage by allocating multiple processing cores. Outer loop unrolling realizes pipelining by instantiating multiple processing cores across stages. The techniques are evaluated based on cost and performance metrics like usage of FPGA slices and block RAMs. Experimental results show combinations of the techniques can achieve cost-optimized FFT implementations for different performance levels.

Uploaded by

Pankaj Joshi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

63 views13 pages

Synthesis of Fpga Synthesis of Fpga - Based FFT Based FFT Implementations Implementations Implementations Implementations

Uploaded by

Pankaj Joshi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

SYNTHESIS OF FPGAFPGA-BASED FFT IMPLEMENTATIONS

Hojin Kee1, Newton Peterson2, 2, Shuvra 1 J Jacob b Kornerup K Sh S. S Bhattacharyya Bh h of Electrical and Computer Engineering, University of Maryland, College Park, 20742, USA. 2National N ti l Instruments I t t Corporation, C ti Austin, A ti 78759, 78759 USA. USA
1Department

Overview
Propose a systematic approach for synthesizing fieldprogrammable gate array (FPGA) implementations of fast F Fourier i transform t f computations. t ti Proposed approach is composed of two orthogonal techniques FFT inner loop p unrolling g and outer loop p unrolling g to perform design space exploration in terms of cost and performance. Achieve cost-optimized cost optimized FFT implementations, subject to user-specified performance levels. Proposed techniques that can be retargeted to different kinds of FPGA devices.

Introduction
Fast Fourier transform (FFT) computation potentially requires multi-cycle processing blocks as its computational complexity is blocks, O(N*logN), where N is the number of inputs. Proposed approaches. Outer O t loop l unrolling lli : R Realizing li i pipelining i li i by instantiating multiple processing cores across FFT butterfly stages. Inner loop unrolling : Realizing parallelism by allocating multiple cores within each stage. Our synthesis approach is prototyped in National Instruments LabVIEW FPGA 8.5. Cost metric
Usage of FPGA slices 1 of Block RAMs Usage

Related Works
Ma [2] developed an efficient method for in-place memory management in FFT implementation, but this approach is restricted t i t d to t a single i l b butterfly tt fl unit. it Nordin et al. [4] presented a parameterized soft core generator for the FFT based on the Peace FFT algorithm g with the stride permutation approach proposed by Takala et al. [5]. Jackson et al. [6] proposed a systolic structure to provide for high throughput FFT implementation implementation. Distinguishing aspect in our approach : Realization of data
parallelism and pipelining with a carefully-configured address generator. t
No special permutation structures for butterfly operations. Efficient utilization of FPGA slices subject to user-defined performance.
3

Unrolling techniques
A basic FFT core (BFC) provides dedicated hardware for one butterfly operation. K- times throughput improvement
Running BFCs simultaneously across stages. Incorporating p gp parallelism inside the BFC within a given stage.

Two unrolling techniques show different cost functions in terms of usage of FPGA slices or BRAMs. The two approaches should be considered jointly for cost-efficient FPGA-based, FFT implementation.
4

Outer Loop Unrolling

In unrolling factor k > 1,
Instantiates k BFCs. (k-1) BFCs take The last BFC takes loop iterations in each. loop p iterations.

This approach introduces k identical copies of the sub-FFT core. It is expected that a factor k of increase in hardware cost results. Trade-offs associated with outer loop unrolling are complemented by inner loop unrolling. unrolling

Inner Loop Unrolling (Read)

Indices of two inputs, u and l, for a butterfly unit in the pth stage are identical, except for the p-th bit in their binary patterns. Define two functions Let x1=110 and x2 = 01100 RL(x2, 2) = 10001, RR(x1, 1) = 011 CONCAT(x1, x2) = 11001100 Read 2k inputs for k BFCs with a single address. Ap = an-r-2 an-r-1 a0 : Address for all inputs. B0p = br br-1 b1 0 : Index of 1st DM bank for BFC B1p = br br-1 b1 1 : Index of 2nd DM bank for BFC

Address = an-r-2 an-r-3 a0 bpRAM = br br-1 b1 0

BFC

bpRAM

= br br-1 b1 1

u or l = RL(CONCAT(RR(Ap, p),Bp),p) = an-r-2an-r-3 apbrbr-1b0ap-1ap-2a0

(1)

Inner Loop Unrolling (Write)

Outputs in the p-th stage should be written to a DM bank so th t it will that ill b be ready d f for th the read di in th the (p+1)1) th stage. t The destined DM bank index and its associated address for writing g butterfly y output p data can be g generated by y an inverse mapping of (1).
u or l = RL(CONCAT(RR(Ap, p),Bp),p) = an-r-2an-r-3 apbrbr-1b0ap-1ap-2a0 = RL(CONCAT(RR(Ap+1, p+1),Bp+1),p+1) Ap+1=an-r-2an-r-3ap+1b0ap-1ap-2a0 Bp+1 = apbrbr-1b1

Inner Loop Unrolling (Write) cont.

Address = an-r-2 an-r-3 a0
bpRAM = br br-1 b1 0 = 12 = 1100
BFC

Destined BRAM index Bp+1 = ap br br-1b1 Destined Address Ap+1=an-r-2an-r-3ap+1b0ap-1ap-2a0

switch
output address =1 0 1 0

reg
output address =1110

bp+1RAM = (ap=0) br br-1 b1 =0110 = 6 bp+1RAM = (ap=1) br br-1 b1 =1110 = 14

bpRAM = br br-1 b1 1 = 13 = 1101

reg

br br-1 b1 = 1 1 0

Simple interconnection network

Cost/Performance Analysis
Cost model for outer loop unrolling/ inner loop unrolling. We calibrate the model using synthesis results.
uinner = sinner*uinitial(kinner-1)+uinitial uouter = souter*uinitial(kouter-1)+uinitial
uinner/uouter uinitial kinner/kouter sinner/souter unrolling : Amount of utilization after inner/outer loop unrolling : Amount of utilization without loop unrolling : Unrolling factors : The slope p of the linear p plots from synthesis y for inner ( (outer) ) loop p

Analytic combined analytic cost function.

ucombined = souter*u uinner(kouter-1)+u 1) uinner kcombined = kouter*kinner
ucombined : Amount of utilization after a combination of inner/outer loop unrolling kcombined : Speedup S d resulting lti f from such h a combination bi ti
9

Experimental Results
Figure 3 reports the FPGA resource utilization when the target speedup is 6. (kinner, kouter)=(3, 2) shows the best utilization performance in the target speedup. This matches to the results from the analytic cost function we analyzed. For streaming FFT performance, our approach requires 23% less FPGA slices compared to the Xilinx core, but 140% more BRAMs. For the sequential performance level, our approach requires 30% fewer slices, and 17% more BRAMs.

Conclusion
Our approach incorporates efficient FFT address generation and memory management, and applies two orthogonal loop unrolling methods et ods to op provide o de a tu tunable ab e trade-off ade o be between ee pe performance o a ce a and d FPGA resource costs. We also develop an analytical approach for high level design space exploration, which allows one to estimate the most resourceresource efficient FFT architecture configuration for a given throughput constraint and a given critical target resource. A distinguishing characteristic of our approach approach, compared to commercially available FFT IP cores, is that we provide a systematic method to generate an FPGA-based FFT architecture while taking into account trade trade-offs offs between performance and cost.

References
[1] J. W. Cooley and J. W. Tukey, An algorithm for the machine calculation of complex Fourier series, Mathematics of Computation, Vol. 19, No. 90, 297-301, 1965. [2] Y. Ma, An Effective Memory Addressing Scheme for FFT Processors, IEEE T Transactions ti on Signal Si lP Processing, i vol. l 47 47, I Issue 3 3, pp. 907 907-911, 911 M March h 1999 1999. [3] W. Wolf. FPGA-Based System Design. Prentice Hall, 2004. [4] G. Nordin, P. A. Milder, J. C. Hoe, M. Puschel, Automatic Generation of Customized Discrete Fourier Transform IPs IPs , Design Automation Conference Conference, pp pp. 471471 474, 474 2005. [5] J. Takala, T. Jarvinen, P. Salmela, and D. Akopian. Multi-port interconnection networks for radix-r algorithms. In Proc. IEEE Intl. Conf. Acoustics, Speech, Signal P Processing, i 2001 2001. [6] P. A. Jackson, C. P. Chan, J. E. Scalera, C. M. Rader, and M. M. Vai, A Systolic FFT Architecture for Real Time FPGA Systems, High Performance Embedded Computing Workshop, 2004

AZ-104T00A Azure Virtual Machines
100% (1)
AZ-104T00A Azure Virtual Machines
36 pages
Programmable Logic Controller PLC
No ratings yet
Programmable Logic Controller PLC
53 pages
FFT128 Project
No ratings yet
FFT128 Project
70 pages
ADC Testing Using Histogram Method
No ratings yet
ADC Testing Using Histogram Method
17 pages
Cat I A Automation
No ratings yet
Cat I A Automation
2 pages
Flyer RCS 9698G H Gateway
No ratings yet
Flyer RCS 9698G H Gateway
2 pages
ZTE ZXHN H268A (V1.1) User Manual EXETEL 09 Feb 2017
No ratings yet
ZTE ZXHN H268A (V1.1) User Manual EXETEL 09 Feb 2017
12 pages
Intrebari Cursul2
No ratings yet
Intrebari Cursul2
44 pages
03 - Azure Active Directory Retail Deployment Guide - Security
No ratings yet
03 - Azure Active Directory Retail Deployment Guide - Security
24 pages
Avocent Acs 800acs 8000 Advanced Console System Command Reference Guide
No ratings yet
Avocent Acs 800acs 8000 Advanced Console System Command Reference Guide
58 pages
E 32
No ratings yet
E 32
21 pages
4FE (GE) +2POTS+WiFi GPON HGU USER MANUAL - v2.0
No ratings yet
4FE (GE) +2POTS+WiFi GPON HGU USER MANUAL - v2.0
63 pages
Computer Organisation and Architecture - COA-Asynchronous Data Transfer
No ratings yet
Computer Organisation and Architecture - COA-Asynchronous Data Transfer
6 pages
Oow 2012 OracleRealApplicationTesting
No ratings yet
Oow 2012 OracleRealApplicationTesting
31 pages
Easic Ipds 08 FFT v2 1 Datasheet
No ratings yet
Easic Ipds 08 FFT v2 1 Datasheet
64 pages
Blood Bank Project Report
No ratings yet
Blood Bank Project Report
28 pages
Dm02 FTTM Adv
No ratings yet
Dm02 FTTM Adv
36 pages
Error
No ratings yet
Error
4 pages
Fast Implementation of CV Algorithms: Using Floating Point Hardware For Numeric Intensive Algorithms
No ratings yet
Fast Implementation of CV Algorithms: Using Floating Point Hardware For Numeric Intensive Algorithms
21 pages
LiU Tek Lic 2003 23 W - Li
No ratings yet
LiU Tek Lic 2003 23 W - Li
120 pages
Migration Concept Presentation - CDPL
No ratings yet
Migration Concept Presentation - CDPL
12 pages
Lenovo DS2200 Storage Configuration - NISHANT PANCHAL'S Blogs
No ratings yet
Lenovo DS2200 Storage Configuration - NISHANT PANCHAL'S Blogs
10 pages
Visualgps Read Me
No ratings yet
Visualgps Read Me
18 pages
A Systolic FFT Architecture For Real Time FPGA Systems
No ratings yet
A Systolic FFT Architecture For Real Time FPGA Systems
33 pages
FULLTEXT01
No ratings yet
FULLTEXT01
134 pages
Unit-5 DSP
No ratings yet
Unit-5 DSP
48 pages
CSC 204 - Final Study Guide
No ratings yet
CSC 204 - Final Study Guide
12 pages
Digital Filtering in Hardware: Adnan Aziz
No ratings yet
Digital Filtering in Hardware: Adnan Aziz
102 pages
Primecore 20100222
No ratings yet
Primecore 20100222
23 pages
An Efficient FPGA Architecture For Reconfigurable FFT Processor Incorporating An Integration of An Improved CORDIC and Radix-2 Algorithm
No ratings yet
An Efficient FPGA Architecture For Reconfigurable FFT Processor Incorporating An Integration of An Improved CORDIC and Radix-2 Algorithm
29 pages
Opencl On Fpga: Marc Gaucheron INTEL Programmable Solution Group
No ratings yet
Opencl On Fpga: Marc Gaucheron INTEL Programmable Solution Group
128 pages
Integration, The VLSI Journal: Ankur Changela, Mazad Zaveri, Deepak Verma
No ratings yet
Integration, The VLSI Journal: Ankur Changela, Mazad Zaveri, Deepak Verma
12 pages
USR TCP232 304 Datasheet
No ratings yet
USR TCP232 304 Datasheet
1 page
SCANNER
No ratings yet
SCANNER
22 pages
Digital System Design KEC302
No ratings yet
Digital System Design KEC302
3 pages
Icanview 372
No ratings yet
Icanview 372
31 pages
Installation TDFA 3.1.2 en
No ratings yet
Installation TDFA 3.1.2 en
9 pages
SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications
No ratings yet
SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications
23 pages
CORDIC
No ratings yet
CORDIC
32 pages
Doc1132 PDF
No ratings yet
Doc1132 PDF
9 pages
Design and Implementation of Parallel Bit Reversal On FFT by Using Verilog H PDF
No ratings yet
Design and Implementation of Parallel Bit Reversal On FFT by Using Verilog H PDF
5 pages
A Radix-8/4/2 FFT Processor For OFDM Systems: Jungmin Park
No ratings yet
A Radix-8/4/2 FFT Processor For OFDM Systems: Jungmin Park
15 pages
FFT and Ifftv Seminar Project
No ratings yet
FFT and Ifftv Seminar Project
83 pages
ELEC692 VLSI Signal Processing Architecture: Architecture For Fourier Transform
No ratings yet
ELEC692 VLSI Signal Processing Architecture: Architecture For Fourier Transform
40 pages
VLSI Lab Manual
No ratings yet
VLSI Lab Manual
83 pages
FFT Implementation in FPGA
No ratings yet
FFT Implementation in FPGA
52 pages
E-Mail: Linkdin:: Rajesh Kunnath Valappil
No ratings yet
E-Mail: Linkdin:: Rajesh Kunnath Valappil
4 pages
Efficient Implementation of Scan Register Insertion On Integer Arithmetic Cores For Fpgas
No ratings yet
Efficient Implementation of Scan Register Insertion On Integer Arithmetic Cores For Fpgas
6 pages
On FRFT
No ratings yet
On FRFT
11 pages
An Efficient 64-Point Pipelined FFT Engine: November 2010
No ratings yet
An Efficient 64-Point Pipelined FFT Engine: November 2010
6 pages
15.pipelined Parallel FFT Architectures Via Folding Transformation
No ratings yet
15.pipelined Parallel FFT Architectures Via Folding Transformation
14 pages
Hw-Efficient Reduced-Latency Architecture For Configurable Mixed-Radix FFT Processors
No ratings yet
Hw-Efficient Reduced-Latency Architecture For Configurable Mixed-Radix FFT Processors
7 pages
Abstract - Implementing FFT and IFFT On FPGA
No ratings yet
Abstract - Implementing FFT and IFFT On FPGA
8 pages
PTFFT 2
No ratings yet
PTFFT 2
6 pages
Base Paper FPR FFT
No ratings yet
Base Paper FPR FFT
5 pages
Comparative Study of Various FFT Algorithm Implementation On FPGA
No ratings yet
Comparative Study of Various FFT Algorithm Implementation On FPGA
4 pages
Designing and Simulation of 32 Point FFT Using Radix-2 Algorithm For Fpga
No ratings yet
Designing and Simulation of 32 Point FFT Using Radix-2 Algorithm For Fpga
9 pages
Information Technology Curriculum
No ratings yet
Information Technology Curriculum
4 pages
VWV Finaly
No ratings yet
VWV Finaly
11 pages
Area-Efficient Architecture For Fast Fourier Transform
No ratings yet
Area-Efficient Architecture For Fast Fourier Transform
7 pages
Butterfly Unit Supporting Radix-4 and Radix-2 FFT
No ratings yet
Butterfly Unit Supporting Radix-4 and Radix-2 FFT
8 pages
Announcements: - Lab 1 Due in One Week
No ratings yet
Announcements: - Lab 1 Due in One Week
8 pages
FFT Algorithms: A Survey: Pavan Kumar K M, Priya Jain, Ravi Kiran S, Rohith N, Ramamani K
No ratings yet
FFT Algorithms: A Survey: Pavan Kumar K M, Priya Jain, Ravi Kiran S, Rohith N, Ramamani K
5 pages
Design of A Power Optimized L024-Point 32-Bit
No ratings yet
Design of A Power Optimized L024-Point 32-Bit
3 pages
Design and Simulation of 32-Point FFT Using Radix-2 Algorithm For FPGA 2012
No ratings yet
Design and Simulation of 32-Point FFT Using Radix-2 Algorithm For FPGA 2012
5 pages
Design and Implementation of Pipelined FFT Processor: D.Venkata Kishore, C.Ram Kumar
No ratings yet
Design and Implementation of Pipelined FFT Processor: D.Venkata Kishore, C.Ram Kumar
4 pages
AdGent Digital - Employment Opportunity in Romania
No ratings yet
AdGent Digital - Employment Opportunity in Romania
2 pages
Assignment No 4 FFT
No ratings yet
Assignment No 4 FFT
4 pages
Holiday Home Work It
No ratings yet
Holiday Home Work It
14 pages
SSRN Id3869494
No ratings yet
SSRN Id3869494
5 pages
Design and Implementation of A 1024-Point
No ratings yet
Design and Implementation of A 1024-Point
5 pages
The Design and Implementation of FFT Algorithm Based On The Xilinx FPGA IP Core
No ratings yet
The Design and Implementation of FFT Algorithm Based On The Xilinx FPGA IP Core
3 pages
On-Chip Implementation of High Speed and High Resolution Pipeline Radix 2 FFT Algorithm
No ratings yet
On-Chip Implementation of High Speed and High Resolution Pipeline Radix 2 FFT Algorithm
3 pages
A VLSI Array Processor For 16-Point FFT
No ratings yet
A VLSI Array Processor For 16-Point FFT
7 pages
Varun Gautam MT23202 Lab2 HW
No ratings yet
Varun Gautam MT23202 Lab2 HW
11 pages
VWV Finaly 1
No ratings yet
VWV Finaly 1
11 pages
Lec5 FPGA
No ratings yet
Lec5 FPGA
46 pages
22ec902 Unit 5 Fpga Architecture and Applications
No ratings yet
22ec902 Unit 5 Fpga Architecture and Applications
61 pages
Design and Implementation of Reconfigurable FFT Processor Using Error Detection and Correction System
No ratings yet
Design and Implementation of Reconfigurable FFT Processor Using Error Detection and Correction System
5 pages
Application-Development 2008 Reconfigurable-Computing
No ratings yet
Application-Development 2008 Reconfigurable-Computing
4 pages
X X X W X X X: A Novel Design of 1024-Point Pipelined FFT Processor Based On Cordic Algorithm
No ratings yet
X X X W X X X: A Novel Design of 1024-Point Pipelined FFT Processor Based On Cordic Algorithm
4 pages
EEE4084F Lecture19
No ratings yet
EEE4084F Lecture19
26 pages
Lecture04 - High-Level Digital Design Automation
No ratings yet
Lecture04 - High-Level Digital Design Automation
30 pages
Cisco Packet Tracer Implementation: Building and Configuring Networks: 1, #1
From Everand
Cisco Packet Tracer Implementation: Building and Configuring Networks: 1, #1
S. R. Jena
No ratings yet
LEARN MPLS FROM SCRATCH PART-A: A Beginner's Guide to Next Level of Networking
From Everand
LEARN MPLS FROM SCRATCH PART-A: A Beginner's Guide to Next Level of Networking
POONAM DEVI
No ratings yet
Exploring BeagleBone: Tools and Techniques for Building with Embedded Linux
From Everand
Exploring BeagleBone: Tools and Techniques for Building with Embedded Linux
Derek Molloy
4/5 (1)
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
Franco Mario
No ratings yet
Loop-shaping Robust Control
From Everand
Loop-shaping Robust Control
Philippe Feyel
No ratings yet
Preliminary Specifications: Programmed Data Processor Model Three (PDP-3) October, 1960
From Everand
Preliminary Specifications: Programmed Data Processor Model Three (PDP-3) October, 1960
Digital Equipment Corporation
No ratings yet

Synthesis of Fpga Synthesis of Fpga - Based FFT Based FFT Implementations Implementations Implementations Implementations

Uploaded by

Synthesis of Fpga Synthesis of Fpga - Based FFT Based FFT Implementations Implementations Implementations Implementations

Uploaded by

SYNTHESIS OF FPGAFPGA-BASED FFT IMPLEMENTATIONS

Outer Loop Unrolling

Inner Loop Unrolling (Read)

Address = an-r-2 an-r-3 a0 bpRAM = br br-1 b1 0

u or l = RL(CONCAT(RR(Ap, p),Bp),p) = an-r-2an-r-3 apbrbr-1b0ap-1ap-2a0

Inner Loop Unrolling (Write)

Inner Loop Unrolling (Write) cont.

Destined BRAM index Bp+1 = ap br br-1b1 Destined Address Ap+1=an-r-2an-r-3ap+1b0ap-1ap-2a0

bp+1RAM = (ap=0) br br-1 b1 =0110 = 6 bp+1RAM = (ap=1) br br-1 b1 =1110 = 14

bpRAM = br br-1 b1 1 = 13 = 1101

Simple interconnection network

Analytic combined analytic cost function.

You might also like