Implementation of Fast Fourier Transform (FFT) On Graphics Processing Unit (GPU)
Implementation of Fast Fourier Transform (FFT) On Graphics Processing Unit (GPU)
by
September, 2010
IMPLEMENTATION OF FAST FOURIER TRANSFORM ON A
GRAPHICS PROCESSING UNIT
By
ADVISOR:
Co ADVISOR
September, 2010
RESTRICTED
by
APPROVED
RESTRICTED
RESTRICTED
ABSTRACT
The Fourier Transform is a widely used tool in many scientific and engineering fields like
Digital Signal Processing. The Fast Fourier Transform (FFT) refers to a class of
algorithms for efficiently computing the Discrete Fourier Transform (DFT). FFT is a
computationally intensive algorithm and real time implementation of FFT on a general
purpose CPU is challenging due to limited processing power.
Graphics Processing Units (GPUs) are an emerging breed of massively parallel
processors having hundreds of processing cores in contrast to the CPUs. Greater
computational power and the parallel architecture of GPUs help them outperform CPUs
for data-parallel compute applications by a huge factor. The growing computational
power of GPUs has introduced the concept of General Purpose Computing on GPU
(GPGPU).
Open Computing Language (OpenCL) is a developing, royalty-free standard for cross-
platform general-purpose parallel programming. OpenCL provides a uniform
programming environment for developing efficient and portable software for multi-core
CPUs and GPUs.
The aim of this project is to implement Radix-2 FFT algorithm on a state-of-the-art AMD
GPU ATI Radeon 5870 using OpenCL programming language. 1D and 2D FFT
algorithms were successfully implemented with significant performance gains.
TABLE OF CONTENTS
ii
RESTRICTED
RESTRICTED
ABSTRACT.......................................................................................................................ii
TABLE OF CONTENTS...................................................................................................iii
LIST OF FIGURES...........................................................................................................vi
DEDICATION...................................................................................................................vii
CHAPTER 1.......................................................................................................................1
PROJECT INTRODUCTION.............................................................................................1
Project Title.....................................................................................................................1
Project Introduction........................................................................................................1
CHAPTER 2.......................................................................................................................3
LITERATURE REVIEW.....................................................................................................3
FFT on Graphics Hardware............................................................................................3
CUDA FFT Library (CUFFT)..........................................................................................4
Apple FFT Library...........................................................................................................5
IPT ATI Project...............................................................................................................5
Project Motivation...........................................................................................................5
CHAPTER 3.......................................................................................................................6
GRAPHICS PROCESSING UNIT (GPU)..........................................................................6
Introduction.....................................................................................................................6
Major GPU Vendors.......................................................................................................7
Evolution of GPUs..........................................................................................................8
GPU Capabilities............................................................................................................9
General Purpose GPU (GPGPU)...................................................................................9
CPU Vs GPU................................................................................................................10
GPU Application Areas.................................................................................................13
CHAPTER 4.....................................................................................................................14
GPU ARCHITECTURE....................................................................................................14
iii
RESTRICTED
RESTRICTED
Flynn's Taxonomy........................................................................................................14
Single Instruction Multiple Data (SIMD) Architecture...................................................14
Generalized GPU Architecture.....................................................................................15
ATI Radeon 5870 Architecture.....................................................................................15
Memory Hierarchy of ATI 5870....................................................................................18
CHAPTER 5.....................................................................................................................20
GPGPU PROGRAMMING ENVIRONMENT...................................................................20
Introduction...................................................................................................................20
Compute Unified Device Architecture (CUDA)............................................................21
Open Computing Language (OpenCL)........................................................................21
Anatomy of OpenCL.....................................................................................................22
OpenCL Architecture....................................................................................................23
OpenCL Execution Model............................................................................................24
OpenCL Memory Model...............................................................................................25
OpenCL Program Structure..........................................................................................27
CHAPTER 6.....................................................................................................................28
FAST FOURIER TRANSFORM (FFT)............................................................................28
Introduction...................................................................................................................28
Fourier Transform.........................................................................................................28
Categories of Fourier Transform..................................................................................29
Discrete Fourier Transform (DFT)................................................................................29
Fast Fourier Transform (FFT).......................................................................................30
FFT Algorithms.............................................................................................................31
Radix-2 FFT Algorithm.................................................................................................31
Decomposition of Time Domain Signal........................................................................32
Calculating Frequency Spectra....................................................................................33
Frequency Spectrum Synthesis...................................................................................33
Reducing Operations Count.........................................................................................35
iv
RESTRICTED
RESTRICTED
CHAPTER 7.....................................................................................................................37
OPENCL IMPLEMENTATION........................................................................................37
Introduction...................................................................................................................37
Data Packaging............................................................................................................37
1D FFT Implementation................................................................................................37
Data Decomposition.....................................................................................................38
Parallel Implementation of Elster Algorithm.................................................................39
Butterfly Computations.................................................................................................39
Improved Program Structure........................................................................................41
2D FFT Implementation................................................................................................42
Matrix Transpose Implementation................................................................................42
Matrix Transpose Using Local Memory........................................................................44
CHAPTER 8.....................................................................................................................46
RESULTS AND CONCLUSION......................................................................................46
Computing Environment...............................................................................................46
Experiment Setup.........................................................................................................46
1D FFT Results............................................................................................................46
2D FFT Results............................................................................................................47
Analysis of Results.......................................................................................................48
Conclusion....................................................................................................................49
REFERENCES................................................................................................................50
LIST OF FIGURES
RESTRICTED
RESTRICTED
DEDICATION
vi
RESTRICTED
RESTRICTED
I dedicate this report to my parents who have been a source of inspiration for me; to my
sisters who made me believe in myself when I thought that I could not make it; and
finally to my friends just for being there.
vii
RESTRICTED
RESTRICTED
ACKNOWLEDGEMENT
I am grateful to my advisor Sqn Ldr Tauseef ur Rehman for all the guidance and
support. I also like to thank the department of Avionics engineering for providing a
conducive environment for research and study.
viii
RESTRICTED
RESTRICTED
CHAPTER 1
PROJECT INTRODUCTION
Project Title
Project Introduction
2. The Fourier Transform is a well known and widely used tool in many scientific
and engineering fields. Fourier transform converts a signal from time domain to
frequency domain. It is essential for many image processing techniques including
filtering, convolution, manipulation, correlation, and compression.
3. The Fast Fourier Transform (FFT) refers to a class of algorithms for efficiently
computing the Discrete Fourier Transform (DFT). FFT is a computationally intensive
algorithm and real time implementation of FFT on a general purpose CPU is challenging
due to limited processing power.
RESTRICTED
RESTRICTED
5. NVIDIA and AMD are the leading manufacturers of GPUs. NVIDA has
introduced a GPU development platform, Compute Unified Device Architecture (CUDA)
for general purpose computing on GPUs. CUDA is not a cross-platform tool and its use
is limited only to NVIDIA GPUs [2]. NVIDIA has developed a GPU accelerated FFT
library called CUFFT. AMD GPUs cannot take advantage of this library.
7. The aim of this project is to implement Radix-2 FFT algorithm on an AMD GPU
using OpenCL as the programming language. A state-of-the-art AMD GPU ATI Radeon
5870 will be used for implementation of this code. ATI Radeon 5870 can potentially
deliver a peak computational power of 2.72 Tera floating point operations per second
(FLOPS).
8. The focus of the project is to exploit data parallelism inherent in the FFT
algorithm and develop a code that exploits the AMD’s GPU architecture by fully utilizing
its parallel computing resources. Performance benchmarking of this implementation
against state-of-the-art Intel CPUs will be done.
RESTRICTED
RESTRICTED
CHAPTER 2
LITERATURE REVIEW
2. With the introduction of GPGPU concept in 2002 the GPUs became an attractive
target for computation because of their high performance and low cost compared to
parallel vector machines and CPUs. At that time, general purpose algorithms for the
GPU had to be mapped to the programming model provided by graphics APIs, like
OpenGL and DirectX. The graphic APIs were unable to fully utilize the compute
resources as access to the low-level hardware features was not supported. The use of
graphics APIs for GPGPU was challenging and the performance gains were less
compared to the programming efforts.
3. K. Moreland and E. Angel [4] were among the first to implement FFT on graphics
hardware in 2003 using graphics APIs. Implementation was done on NVIDIA GeForce
FX 5800 Ultra graphics card. This graphics card features fully programmable pipeline
and full 32 bit floating point math enabled throughout the entire pipeline. Programming
environment used was OpenGL and the Cg computer language and runtime libraries.
The average performance achieved was 2.5 Giga FLOPS. J. Spitzer implemented FFT
on NVIDIA GeForce FX 5800 Ultra graphics card in 2003 [5], using the graphics APIs
and reported peak performance was 5 Giga FLOPS.
4. These FFT implementation on GPUs revealed that using graphics APIs for
GPGPU is inefficient and achievable peak performance was very limited as compared
to the programming efforts. Developers came up with the solution in the form of non-
graphics APIs to fully utilize the compute resources of GPU and reducing the
programming effort.
RESTRICTED
RESTRICTED
5. NVIDIA launched CUDA in 2007 [6], allowing the developers to fully utilize
immense GPU power by accessing all hardware features and resources via an
industrial standard high-level computer language C. With CUDA, the programming effort
was reduced and performance gains were much more significant. In February, 2007
NVIDIA launched first GPU accelerated FFT library, CUDA FFT Library (CUFFT) [7].
6. CUFFT is the first GPU accelerated FFT library. Initial version of CUFFT was
released in February, 2007. The latest release is CUFFT version 3.0 launched in
February, 2010 [8]. The salient features of this library are listed below.
(d) 2D and 3D transform sizes in the range [2, 16384] in any dimension
(e) In-place and out-of-place transforms for real and complex data
RESTRICTED
RESTRICTED
9. In February 2010, Apple Inc published FFT library for Mac OS X implementation
of OpenCL [9]. This FFT library includes all the features of CUFFT library. The runtime
requirements for this library are Mac OS X v10.6 or later with OpenCL 1.0 support and
Apple’s Xcode compiler. This limits the use of this library to Apple computers only.
10. OpenCL developer community modified the library to make it compatible with
AMD’s OpenCL implementation. The largest transform size achieved in this way is 1024
points with reported issues of numerical accuracy.
11. In February 2010, Jingfei Kong published OpenCL implementation of FFT for
AMD ATI GPUs. This implementation accelerates MATLB FFT using MATLAB external
interface (MEX) [10]. This implementation only supports 1D transforms in single
precision.
Project Motivation
12. This project is a step towards the development of OpenCL FFT library for AMD
ATI GPUs which is comparable to CUFFT library in features and performance.
Contribution of this project will be implementation of 1D transforms, batched 1D
transforms and 2D transforms.
RESTRICTED
RESTRICTED
CHAPTER 3
Introduction
2. Graphics Card. Graphics Card is a peripheral device and interfaces with the
motherboard by means of an expansion slot such as Peripheral Component
Interconnect Express (PCIe) Slot or Accelerated Graphics Port (AGP). Fig. 3-1 shows
ATI 5870 graphics card.
RESTRICTED
RESTRICTED
3. The key components of a graphics card are GPU, Video memory, Output
Interface and Motherboard Interface. Fig. 3-2 depicts the layout of ATI 5870 graphics
card [11].
4. Graphics cards offer added functions, such as video capture and decoding, TV
output, or the ability to connect multiple monitors. High performance graphics cards are
used for more graphically demanding purposes, such as PC games.
5. In 2008, Intel, NVIDIA and AMD/ATI were the market share leaders, with 49.4%,
27.8% and 20.6% market share respectively. However, these numbers include Intel's
very low-cost, less powerful integrated graphics solutions as GPUs [12].
6. In June 2010, 30 days GPU market share survey was done by passmark [13].
The survey was carried out for dedicated graphics cards only. According to this survey
NVIDIA and AMD/ATI are the leading GPU manufacturers with 49% and 34% market
RESTRICTED
RESTRICTED
share respectively, while Intel captures 10% of market share. Fig. 3-3 shows the results
of this survey.
Evolution of GPUs
7. The history of graphics processors traces back to 1980s when 2D graphics and
text displays were generated by graphic chips called accelerators.
8. The IBM Professional Graphics Controller was one of the very first 2D graphics
accelerators available for the IBM PC released in 1984 [12]. Its high price, slow
processor and lack of compatibility with commercial programs made it unable to
succeed in the mass-market.
RESTRICTED
RESTRICTED
10. In 1997, 3D accelerators added another significant hardware stage for hardware
transforms and lightning to the 3D graphics pipeline. The NVIDIA GeForce 256 (NV10)
was the first card on the market with this capability in 1999 [12].
11. NVIDIA was first to produce a chip with programmable graphics pipeline in 2000,
GeForce 3 (NV20). By October 2002, ATI Radeon 9700 (R300), the world's first
Direct3D 9.0 accelerator, added floating point math capability to GPUs [12]. Since then
GPUs are quickly becoming as flexible as CPUs, and orders of magnitude faster for
image-array operations.
GPU Capabilities
12. Historically, GPUs have been used primarily for graphics processing and
gamming purposes. As graphics processing is inherently a parallel task, so GPUs
naturally have more parallel architecture as compared to standard CPUs. Furthermore,
the 3D video games demand very high computational power, driving the GPU
development beyond CPUs. Thus modern GPUs, in comparison to CPUs, offer
extremely high performance for the monetary cost.
13. Modern GPUs have a more flexible and programmable graphic pipeline and offer
high peak performance. Naturally, interest has been developed as to whether the GPU
processing power can be harnessed for more general purpose calculations.
14. The addition of programmable stages and higher precision arithmetic to the
graphics pipelines allows software developers to use GPUs for processing non-graphics
data. The idea of utilizing the parallel computing resources of a GPU for non graphics
general purpose computations is named as General-Purpose computation on Graphics
Processing Unit (GPGPU). The term GPGPU was coined by Mark Harris in 2002 when
he recognized an early trend of using GPUs for non-graphics applications [1].
RESTRICTED
RESTRICTED
15. GPGPU is the technique of using a GPU, which typically handles computation
only for computer graphics, to perform computation in applications traditionally handled
by the CPU. Once specially designed for computer graphics and difficult to program,
today’s GPUs are general-purpose parallel processors with support for accessible
programming interfaces and industry-standard languages such as C. Porting
applications to GPUs often achieve speedups of orders of magnitude vs. optimized CPU
implementations.
CPU Vs GPU
17. CPU performance tops out at about 25 Giga FLOPs per core, thus Core i7 with 4
cores delivers a peak performance of 110 Giga FLOPS [15]. Whereas ATI Radeon 5870
delivers 2.72 Tera FLOPs and NVIDIA GTX 285 delivers 1 Tera FLOPs. This huge
performance advantage of GPUs justifies the GPGPU.
10
RESTRICTED
RESTRICTED
11
RESTRICTED
RESTRICTED
19. The key difference between GPUs and CPUs is that while a modern CPU
contains a few high-functionality cores, GPUs typically contain 100s of basic cores.
Each CPU core is capable of running a heavy task independently. So multiple tasks can
map to different cores. The GPU core is a very basic processing element and each core
can perform same operation on different data simultaneously.
20. CPUs use a large chip area for control circuitry and data caching for faster data
accesses. GPUs, on the other hand, have very less area devoted for caches and control
circuitry. Most of the die area is occupied by the ALUs. Thus GPUs gain performance
advantage by allocating a huge number of transistors for floating point calculations.
21. GPUs also boast a larger memory bus width than CPUs which results in faster
memory access. The Dynamic RAM used in modern GPUs is GDDR5 with a much
greater bandwidth compared to DRAM in consumer PCs, generally GDDR2. CPUs
typically operate at 2-3 Ghz clock frequency. The GPU clock frequency is lower than
that of a CPU, typically up to 1.2 Ghz, but this gap has been closing over the last few
years.
12
RESTRICTED
RESTRICTED
22. GPUs offer very high peak performance but all consumer applications can't take
full advantage of this performance. As GPUs are massively parallel devices, the
application needs to be highly parallel to fully utilize the GPU resources. Applications
such as graphics processing are highly parallel in nature, and can keep the cores busy,
resulting in a significant performance improvement over use of a standard CPU.
23. For applications less susceptible to such high levels of parallelization, the extent
to which the available performance can be harnessed will depend on the nature of the
application and the investment put into software development.
24. Following are the major application areas where GPUs provide significant
speedups over standard CPUs.
13
RESTRICTED
RESTRICTED
CHAPTER 4
GPU ARCHITECTURE
Flynn's Taxonomy
14
RESTRICTED
RESTRICTED
3. Fig. 4-2 [18] shows a simplified diagram of a generalized GPU device. A GPU
device comprises of a set of compute units. Each compute unit has a set of stream
cores which are further divided into basic execution units called processing elements.
All ATI GPUs follow similar design pattern, however number of compute units or stream
processors may vary from device to device.
4. ATI Radeon 5870 architecture is given the code name cypress [19]. Fig. 4-3
illustrates the cypress architecture. The GPU comprises of 20 compute units which
operate as SIMD engines. Each SIMD engine consists of 16 stream cores. A stream
core houses 5 processing elements.
5. SIMD Engine. Each compute unit consists of 16 stream cores and operates
as a SIMD engine. All stream cores within a SIMD engine have to execute the same
15
RESTRICTED
RESTRICTED
instruction sequence; different compute units may execute different instructions. Fig. 4-4
[19] shows the internal design of a compute unit.
16
RESTRICTED
RESTRICTED
17
RESTRICTED
RESTRICTED
8. GPUs generally feature a multi-level memory space for efficient data access and
communication within the GPU. Fig. 4-6 [19] illustrates the memory spaces on ATI
5870.
9. Global Memory. Global memory is the main memory pool on the GPU and is
accessible by all processing elements on the GPU for read and writes operations. ATI
5870 features 1GB GDDR5 global memory operating at 1.2 GHz. The data transfer rate
is 4.8 Gbps via 256 bit wide memory bus and memory bandwidth is153.6 GB/s. This is
the slowest memory on the GPU.
10. Constant Data Cache. ATI 5870 GPU features 48 KB of constant data cache
memory used to store the frequently used constant values. This memory is written by
the host and all processing elements have only read access to this memory. Constant
cache is a very fast memory with 4.25 TB/s memory bandwidth.
11. Local Data Share (LDS). Each SIMD engine has 32 KB local memory called
LDS. All processing elements within a SIMD can share data using this memory. LDS
offers 2.125 TB/s memory bandwidth providing low latency data access to each SIMD
18
RESTRICTED
RESTRICTED
engine. LDS is arranged into 32 banks, each with 1KB memory. LDS provides zero
latency reads in broadcast mode and in conflict free reads/writes.
12. Registers. Registers are the fastest memory available on GPU. Each SIMD
engine posses 256 KB register files. Registers provide 13 TB/s memory bandwidth.
Registers are local to each processing element.
13. Global Data Share. ATI 5870 also features low latency global data share
allowing all processing elements to share data. This memory space is not available in
NVIDIA GPUs and old ATI GPUs. The size of GDS is 64 KB and memory access
latency is only 25 clock cycles.
19
RESTRICTED
RESTRICTED
CHAPTER 5
Introduction
2. Over time, developers and GPU vendors evolved shaders from simple assembly
language programs into high-level programs that create the amazingly rich scenes
found in today’s 3D software. To handle increasing shader complexity, the pixel
processing elements were redesigned to support more generalized math, logic, and flow
control operations. This set the stage for a new way to accelerate computation, the
GPGPU.
3. GPU vendors and software developers realized that the trends in GPU designs
offered an incredible opportunity to take the GPU beyond graphics. All that was needed
was a non-graphic Application Program Interface (API) that could engage the emerging
programmable aspects of the GPU and access its immense power for non graphics
applications.
20
RESTRICTED
RESTRICTED
AMD, Intel, ARM Texas Instruments and many others. The OpenCL specifications are
managed by KHRONOS Group [3].
21
RESTRICTED
RESTRICTED
CPUs, GPUs, Cell-type architectures and other parallel processors such as DSPs [3].
OpenCL will form the foundation layer of a parallel computing ecosystem of platform-
independent tools, middleware and applications.
10. OpenCL is being created by the KHRONOS Group with the participation of many
industry-leading companies and institutions including AMD, Apple, ARM, Electronic
Arts, Ericsson, IBM, Intel, Nokia, NVIDIA and Texas Instruments.
11. OpenCL language is based on C99 for writing programs that execute on OpenCL
devices, and APIs that are used to define and then control the platforms. OpenCL
provides parallel computing using task-based and data-based parallelism. Its
architecture shares a range of computational interfaces with two competitors, NVIDIA's
CUDA and Microsoft's DirectCompute.
Anatomy of OpenCL
13. KHRONOs Group released first OpenCL specifications in 2008 [21]. The
OpenCL 1.0 specification is made up of three main parts:
22
RESTRICTED
RESTRICTED
15. Platform Layer API. The platform layer API gives the developer access to
routines that query for the number and types of devices in the system. The developer
can then select and initialize the necessary compute devices to properly run their work
load. The required compute resources for job submission and data transfer are created
at this layer.
16. Runtime API. The runtime API allows the developer to queue up the work
for execution and is responsible for managing the compute and memory resources in
the OpenCL system.
OpenCL Architecture
17. OpenCL Platform Model. The OpenCL platform consists of a Host and
one or more Compute Devices. Host is CPU device running the main operating system.
Compute device is any CPU or GPU device which provides processing power for
OpenCL. OpenCL allows multiple heterogeneous compute devices to connect to a
single host and efficiently divide the work among them.
23
RESTRICTED
RESTRICTED
19. Since OpenCL is meant to target not only GPUs but also other accelerators, such
as multi-core CPUs, flexibility is given in specifying the type of task; whether it is data-
parallel or task-parallel. The OpenCL execution model includes Compute Kernels and
Compute Programs.
20. Compute Kernel. A compute kernel is the basic unit of executable code
and can be thought of as similar to a C function. Execution of such kernels can precede
either in-order or out-of-order depending on the parameters passed to the system when
queuing up the kernel for execution. Events are provided so that the developer can
check on the status of outstanding kernel execution requests and other runtime
requests.
24
RESTRICTED
RESTRICTED
execution domain is a work-item and OpenCL provides the ability to group together
work-items into work-groups for synchronization and communication purposes.
23. OpenCL defines a multi-level memory model with memory ranging from private
memory visible only to the individual compute units in the device to global memory that
is visible to all compute units on the device. Depending on the actual memory
subsystem, different memory spaces are allowed to be collapsed together.
24. OpenCL 1.0 defines 4 memory spaces: private, local, constant and global [22].
Fig. 5-2 shows a diagram of the memory hierarchy defined by OpenCL.
25. Private Memory. Private memory is memory that can only be used by a single
processing element. No two processing elements can access each other's private
memory. This is similar to registers in a single CPU core. This is the fastest memory
available on the GPU but the size of private memory is very limited, generally in kilo
bytes.
26. Local Memory. Local memory is memory that can be used by the work-items
within a work-group. All work-items within a work-group can share data but data can't be
shared among different work-groups. Physically local memory is the local data share
(LDS) that is available on the current generation of GPUs. Each compute unit has its
own local memory shared among all the processing elements in that compute unit.
Local memory is also extremely fast but less as compared to private memory. The size
of LDS ranges from 16 KB to maximum of 48 KB on the latest hardware.
25
RESTRICTED
RESTRICTED
28. Global Memory. Global memory is memory that can be used by all the
compute units on the device. This is similar to the off-chip DRAM that is available on
GPUs. On latest GPUs GDDR5 DRAM is used as global memory and operates at a
bandwidth of 100GB/s and above.
26
RESTRICTED
RESTRICTED
29. OpenCL program consists of a host code that executes on host processor and a
kernel code that executes on the compute device. A CPU device can be used both as
host and compute device.
30. OpenCL Host Code. Host code is a C/C++ program executing on the host
processor to augment the kernel code [22]. A sample host code includes following
steps.
(c) Create a command queue to accept the execution and memory requests
(d) Allocate OpenCL memory to hold the data for the compute kernel
27
RESTRICTED
RESTRICTED
CHAPTER 6
Introduction
1. The Fourier Transform is a well known and widely used tool in many scientific
and engineering fields. The Fourier transform is essential for many image processing
Techniques, including filtering, manipulation, correction, and compression. Fourier
analysis is a family of mathematical techniques, all based on decomposing signals into
sinusoids. The discrete Fourier transform (DFT) is the family member used with
digitized signals and form the basis of digital signal processing. FFT is an efficient
algorithm to compute the DFT and its inverse. This chapter provides a brief description
of Fourier analysis and its applications followed by a detailed description of DFT and
FFT.
Fourier Transform
3. Fourier transform converts a signal from the spatial domain to the frequency
domain by representing it as a sum of properly chosen sinusoids. Spatial domain signal
is defined by amplitude values at specific time intervals. Frequency domain signal is
defined by amplitudes and phase shifts of the various sinusoids that make up the signal.
28
RESTRICTED
RESTRICTED
sinusoid, and at exactly the same frequency and shape as the input. Only the amplitude
and phase can change [23].
5. The general term Fourier transform, can be broken into four categories, resulting
from the four basic types of signals that can be encountered [23].
6. DFT is one of the most important algorithms in Digital Signal Processing (DSP). It
converts a periodic-discrete time domain signal to periodic-discrete frequency domain
29
RESTRICTED
RESTRICTED
signal. As digital computers can only work with information that is discrete and finite in
length, the only Fourier transform that can be used in DSP is the DFT.
7. The input to the DFT is a finite sequence of real or complex numbers, making the
DFT ideal for processing information stored in computers. In particular, the DFT is
widely employed in signal processing and related fields to analyze the frequencies
contained in a sampled signal, to solve partial differential equations, and to perform
other operations such as convolutions or multiplying large integers.
N −1
X k = ∑ x (n)e− j(2 πKn)/ N … … … … … … .. Eq .(6.1)
n=0
where N is the total number of input points (samples) and Xk represents the DFT of
time domain signal xn, where i is the imaginary unit and is a primitive N'th root of
unity and is called the Twiddle Factor (W) . Evaluating this definition directly requires
N2 operations as there are N outputs Xk, and each output requires a sum of N terms.
Thus mathematical complexity of DFT is O(N2) [24].
9. FFT was introduced by J.W. Cooley and J.W. Tukey in 1965 [25]. FFT is
very efficient method to compute the DFT in O(N log N) operations. FFT reduces the
operations count by eliminating O(N) trivial operations such as multiplications by 1. The
difference in speed can be substantial, especially for long data sets where N may be in
the thousands or millions. In practice, the computation time can be reduced by several
orders of magnitude in such cases, and the improvement is roughly proportional to
N/log(N). This huge improvement made many DFT-based algorithms practical.
30
RESTRICTED
RESTRICTED
FFT Algorithms
10. Cooley-Tukey FFT Algorithm. By far the most common FFT is the Cooley–
Tukey algorithm. This is a divide and conquer algorithm that recursively breaks down a
DFT of size N into two pieces of size N / 2 at each step, and is therefore limited to
power-of-two sizes. Commonly used variants of this frame work are listed here.
(a) Radix-2
(b) Radix-4
11. Prime-Factor Algorithm (PFA). PFA introduced by Good and Thomas [26], is
based on the Chinese Remainder Theorem to factorize the DFT similarly to Cooley-
Tukey but without the twiddle factors.
13. Radix-2 FFT algorithm recursively divides the N point input signal to two N/2
point signals at each stage. This recursion occurs until the signal is divided into 2 point
signals, hence the name Radix-2. The input size N must be an integral power of two to
apply this recursion. FFT is computed in log 2 N stages. Total complex multiplications are
(N/2)log 2 N and complex additions are Nlog 2 N [28].
(a) Decompose an N point time domain signal into N time domain signals
each composed of a single point.
31
RESTRICTED
RESTRICTED
16. Bit Reversal Sorting. The interlaced decomposition is done using a bit
reversal sorting algorithm. This algorithm rearranges the order of the N time domain
samples by counting in binary with the bits flipped left-for-right as shown in Fig. 6-2.
This algorithm provides the same output as calculated in Fig. 6-1 using interlaced
decomposition.
32
RESTRICTED
RESTRICTED
17. The second step in computing the FFT is to calculate the frequency spectra of N
time domain signals of single point each. According to the Duality Principle the
frequency spectrum of a 1 point time domain signal is equal to itself [28]. A single point
in the frequency domain corresponds to a sinusoid in the time domain. By duality, the
inverse is also true; a single point in the time domain corresponds to a sinusoid in the
frequency domain. Thus nothing is required to do this step and each of the 1 point
signals is now a frequency spectrum, and not a time domain signal.
33
RESTRICTED
RESTRICTED
synthesized into 4 frequency spectra (4 points each), and so on. The last stage results
in the output of the FFT, a 16 point frequency spectrum.
19. Radix-2 FFT Butterfly. The butterfly is the basic computational element of
the FFT, transforming two complex points input into two complex point output. Fig. 6-3
shows a Radix-2 FFT butterfly.
20. For N=2, the calculations for FFT are shown in Eq. (6.2) through Eq. (6.4).
1
X k =∑ x (n)e− j (2 πKn)/ 2 … … … … … ….. Eq .(6.2)
n=0
X ( 0 ) =x ( 0 )+ x ( 1 ) … … … … … … … … . Eq .(6.3)
21. Figure 6.5 shows the flow graph for FFT calculations. W represents the twiddle
factor. Butterfly shown in Fig. 6-5 requires two complex multiplications and two complex
additions.
22. This butterfly pattern is repeated over and over to compute entire frequency
spectrum. Flow graph for 8 point FFT is shown in Fig. 6-6.
34
RESTRICTED
RESTRICTED
23. The flow graph in Fig. 6-5 depicts the operations count to be O(N 2). FFT reduces
the operation count by exploiting the symmetry property of twiddle factor (W) and
eliminating the trivial multiplications. The general FFT butterfly is shown in Fig6-7.
24. Symmetry Property. Applying the symmetry property defined as Eq. (6.5)
and Eq. (6.6) on the twiddle factors in Fig. 6-7, reduces the complex multiplications by a
35
RESTRICTED
RESTRICTED
factor of two [29]. Fig. 6-8 shows a simplified butterfly with one complex multiplication
and two complex additions.
25. Thus the operation count of FFT is reduced to (N/2)log 2 N complex multiplications
and Nlog 2 N complex additions. The overall complexity is O(Nlog 2N).
36
RESTRICTED
RESTRICTED
CHAPTER 7
OPENCL IMPLEMENTATION
Introduction
1. The previous chapter (chapter 6) introduced Radix-2 FFT algorithm and provided
the mathematical details of the algorithm. This chapter describes the handling of
complex numbers and how the mathematics is implemented to compute FFT on the
GPU using OpenCL programming language.
Data Packaging
3. The real and complex parts of input data are single precision floating point
values. Each complex number can be stored as a two element vector, where each
element is a floating point value. OpenCL supports vector data types with floating point
elements. Float2 is a built in data type in OpenCL which stores two floating point values
in a vector. This data type is well suited for handling complex data type in OpenCL.
4. The host code copies the real and imaginary input values to a buffer in the global
memory of GPU. A kernel is then launched to package the floating point real and
imaginary values to Float2 vector data type. First element of each Float2 vector is the
real part and second element is the corresponding imaginary part.
1D FFT Implementation
37
RESTRICTED
RESTRICTED
Data Decomposition
8. Elster Algorithm. One of the most efficient methods for producing a vector
representation of the bit reversal is Elster's algorithm [31]. The purpose of this algorithm
is to create a vector B with the bit reversal permutation values. Elster computes the N-
point bit reversal vector in log 2(N) steps. For example, for N=8 the vector representation
is shown in Table 7-1:
38
RESTRICTED
RESTRICTED
Initial Value 0
+4 0 4
+2 0 4 2 6
+1 0 4 2 6 1 5 3 7
9. Elster method can be parallelized by dividing the total length N by the number of
parallel processes and calculating each block on a separate processing element.
Assuming each block size to be M, total blocks thus formed are N/M. Each block can be
computed in parallel on a separate processing element. As GPUs have large number of
processing elements this implementation maps well to the GPU architecture.
10. Initial point of each block “head” is pre calculated by applying Elster on Total
blocks. Fig. 7-2 shows an example of Elster on 16 elements. Head is computed by bit
reversing 4 element vector [0 1 2 3]. Applying Elster on this vector results in [0 2 1 3].
The four blocks are computed in parallel on separate processing element, thus
speeding up the calculations.
Butterfly Computations
11. The FFT butterfly is derived by expanding the formula of DFT shown as Eq. (7.1).
N −1
X k = ∑ e− j (2 πKn)/ N … … … … … … … .. Eq .(7.1)
n=0
12. After bit reversal sorting of input data, one stage of 2 point butterflies can be
launched in parallel. Fig. 7-3 shows the simplified FFT butterfly. This butterfly includes
one complex multiplication, two complex additions and a sign change.
39
RESTRICTED
RESTRICTED
13. Simple Program Structure. To calculate the FFT of N points, the simplest
approach is to bit reverse the input data and launch 2 point FFT kernels at each stage
to calculate the entire frequency spectrum. This approach requires log2N kernel
launches with the input stride increasing by a factor of two at each kernel launch.
14. Limitations. The 2 point FFT butterfly requires only 25 ALU operations.
Each thread of 2 point FFT kernel reads 4 floating point values (16 Bytes). The ALU to
Fetch ratio is very less degrading the kernel performance. The ALU to Fetch ratio is a
kernel performance parameter. It is a ratio of the time taken by ALU operations to the
time spent in importing the data. For high kernel performance this ratio should be high.
Doing more calculations per thread can improve the performance.
40
RESTRICTED
RESTRICTED
15. The ALU to Fetch ratio can be improved by calculating 2 or more FFT stages in a
single kernel launch. This approach reduces the total number of kernel launches, thus
reducing the overheads incurred in issuing a kernel call.
16. Hard-Coded FFT Kernels. FFT kernels for 2 points, 4 points and 8 points
are hard-coded in the program. The code us developed by expanding the DFT formula
for N = 2, 4 and 8. 16 point FFT kernel is not hard-coded because it consumes more
register files per thread, thus reducing the total number of concurrent threads, resulting
in performance degradation.
17. Twiddle Factor Calculation. The twiddle factor (W) for each FFT stage is
calculated on the fly. The mathematical equation for calculating twiddle factor is derived
as follows.
18. Applying the identity in Eq. (7.3) to Eq. (7.2) yields the final equation Eq. (7.4)
used to calculate twiddle factor is in the kernel.
RESTRICTED
RESTRICTED
19. Computing Large FFT. FFT of length 16 and above are computed by invoking
a set of hard-coded, small length, FFT kernels. FFT of 1024 points requires 10 stages
and is calculated by launching 8 point FFT kernel thrice, completing 9 stages, followed
by a 2 point FFT kernel for 10th stage. The largest FFT that can be computed using this
implementation is 8 million (224) points.
2D FFT Implementation
21. For GPU, computing FFT on columns is very expensive due to large input strides
in accessing data for calculations. A modified technique of implementing 2D FFT
introduces a matrix transpose stage after 1D FFT on rows. Thus 2D FFT can be
calculated in following steps.
42
RESTRICTED
RESTRICTED
At = A j ,i … … … … … Eq .(7.5)
23. Simple Transpose Kernel. In a simple transpose kernel each thread reads
one element of the input matrix from global memory and writes back the same element
at its transposed index in the global memory. Fig. 7-5 shows a 4x4 matrix with index of
each element.
43
RESTRICTED
RESTRICTED
44
RESTRICTED
RESTRICTED
25. Local memory or LDS is an extremely fast memory available on GPUs providing
zero latency data accesses, ideally. Each SIMD engine on ATI 5870 owns 32 KB LDS.
Data in LDS can be accessed in any pattern without performance penalty if there are no
bank conflicts [32]. Bank conflicts occur when two or more threads try to read same
piece of data from LDS. Using LDS in transpose kernel can improve the performance by
a huge factor.
26. The improved transpose kernel reads the input data from global memory in a
coalesced fashion and writes it in the local memory. Fig. 7-7 shows coalesced data
transfer from global to local memory.
27. Once the data is loaded in local memory each thread can access the transposed
data element and write it out to the global memory in a coalesced fashion. Fig. 7-8
shows how consecutive threads write the transposed elements to the global memory
45
RESTRICTED
RESTRICTED
28. The use of local memory in matrix transpose kernel provides a speedup of over
20 times as compared to the un-optimized kernel.
CHAPTER 8
1. OpenCL code for 1D and 2D FFT is developed using the methodology discussed
in the previous chapter. This chapter summarizes the results of 1D and 2D FFT
implementation and the benchmarks against high-end Intel CPUs.
Computing Environment
Experiment Setup
46
RESTRICTED
RESTRICTED
5. The GPU FFT time is calculated using a ATI Stream Profiler, a software
provided by AMD for accurate performance measurement. ATI stream profiler can
measure time with 1 nano second precision [33].
1D FFT Results
Input Size FFT Time Total Time FFT Time FFT Time
(x1000 points) (ms) (ms) (ms) (ms)
47
RESTRICTED
RESTRICTED
2D FFT Results
Input Size FFT Time Total Time FFT Time FFT Time
(Points) (ms) (ms) (ms) (ms)
Analysis of Results
7. The results in Table 8-1 and Table 8-2 show that OpenCL FFT implemented on
ATI 5870 GPU outperforms the state of the art Intel CPUs. The performance advantage
increases with increase in the input size. This is because of increasing parallelism as
48
RESTRICTED
RESTRICTED
the input size increases. The data transfer time between GPU and motherboard is a
factor which limits the performance gains.
Conclusion
10. GPU is novel parallel computing architecture providing huge speedups to data-
parallel computations at a relatively lower monetary cost. The performance gains
depend upon the exploitable parallelism in the algorithm being implemented.
12. OpenCL implementation of FFT on ATI 5870 GPU outperforms the latest Intel
CPUs by exploiting data-level parallelism inherent in FFT algorithm. Both 1D and 2D
FFT algorithms have been implemented successfully.
13. Currently, the data transfer to GPU over the PCIe interface presents a
performance bottleneck. The AMD Fusion APUs remove this bottleneck opening new
horizons of parallel computing on GPUs.
49
RESTRICTED
RESTRICTED
REFERENCES
50
RESTRICTED
RESTRICTED
51
RESTRICTED