0% found this document useful (0 votes)
2K views

Implementation of Fast Fourier Transform (FFT) On Graphics Processing Unit (GPU)

Implementation of Radix-2 FFT algorithm on a state-of-the-art AMD GPU ATI Radeon 5870 using OpenCL programming language.

Uploaded by

Aamir Majeed
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2K views

Implementation of Fast Fourier Transform (FFT) On Graphics Processing Unit (GPU)

Implementation of Radix-2 FFT algorithm on a state-of-the-art AMD GPU ATI Radeon 5870 using OpenCL programming language.

Uploaded by

Aamir Majeed
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 61

IMPLEMENTATION OF FAST FOURIER TRANSFORM ON A

GRAPHICS PROCESSING UNIT

by

NUST CDT AAMIR MAJEED (060902)

COLLEGE OF AERONAUTICAL ENGINEERING

PAF ACADEMY RISALPUR

September, 2010
IMPLEMENTATION OF FAST FOURIER TRANSFORM ON A
GRAPHICS PROCESSING UNIT
By

NUST CDT AAMIR MAJEED (060902)

ADVISOR:

SQN LDR DR. TAUSEEF UR REHMAN

Co ADVISOR

WG CDR DR. SOHAIL AHMED

REPORT SUBMITTED IN PARTIAL FULFILLMENT OF


THE REQUIREMENTS FOR THE DEGREE OF BE

COLLEGE OF AERONAUTICAL ENGINEERING

PAF ACADEMY RISALPUR

September, 2010
RESTRICTED

COLLEGE OF AERONAUTICAL ENGINEERING

PAF ACADEMY, RISALPUR

IMPLEMENTATION OF FAST FOURIER TRANSFORM ON A


GRAPHICS PROCESSING UNIT

by

NUST Cdt Aamir Majeed, 69th EC

A report submitted to the College of Aeronautical Engineering in partial fulfillment of the


requirements for the degree of B.E (AVIONICS)

APPROVED

(TAUSEEF UR REHMAN) (JAHANGIR KIYANI)


Squadron Leader Group Captain
Project Advisor Head of Avionics Dept
College of Aeronautical Engineering College of Aeronautical Engineering

RESTRICTED
RESTRICTED

ABSTRACT

The Fourier Transform is a widely used tool in many scientific and engineering fields like
Digital Signal Processing. The Fast Fourier Transform (FFT) refers to a class of
algorithms for efficiently computing the Discrete Fourier Transform (DFT). FFT is a
computationally intensive algorithm and real time implementation of FFT on a general
purpose CPU is challenging due to limited processing power.
Graphics Processing Units (GPUs) are an emerging breed of massively parallel
processors having hundreds of processing cores in contrast to the CPUs. Greater
computational power and the parallel architecture of GPUs help them outperform CPUs
for data-parallel compute applications by a huge factor. The growing computational
power of GPUs has introduced the concept of General Purpose Computing on GPU
(GPGPU).
Open Computing Language (OpenCL) is a developing, royalty-free standard for cross-
platform general-purpose parallel programming. OpenCL provides a uniform
programming environment for developing efficient and portable software for multi-core
CPUs and GPUs.
The aim of this project is to implement Radix-2 FFT algorithm on a state-of-the-art AMD
GPU ATI Radeon 5870 using OpenCL programming language. 1D and 2D FFT
algorithms were successfully implemented with significant performance gains.

TABLE OF CONTENTS

ii

RESTRICTED
RESTRICTED

ABSTRACT.......................................................................................................................ii
TABLE OF CONTENTS...................................................................................................iii
LIST OF FIGURES...........................................................................................................vi
DEDICATION...................................................................................................................vii

CHAPTER 1.......................................................................................................................1
PROJECT INTRODUCTION.............................................................................................1
Project Title.....................................................................................................................1
Project Introduction........................................................................................................1

CHAPTER 2.......................................................................................................................3
LITERATURE REVIEW.....................................................................................................3
FFT on Graphics Hardware............................................................................................3
CUDA FFT Library (CUFFT)..........................................................................................4
Apple FFT Library...........................................................................................................5
IPT ATI Project...............................................................................................................5
Project Motivation...........................................................................................................5

CHAPTER 3.......................................................................................................................6
GRAPHICS PROCESSING UNIT (GPU)..........................................................................6
Introduction.....................................................................................................................6
Major GPU Vendors.......................................................................................................7
Evolution of GPUs..........................................................................................................8
GPU Capabilities............................................................................................................9
General Purpose GPU (GPGPU)...................................................................................9
CPU Vs GPU................................................................................................................10
GPU Application Areas.................................................................................................13

CHAPTER 4.....................................................................................................................14
GPU ARCHITECTURE....................................................................................................14

iii

RESTRICTED
RESTRICTED

Flynn's Taxonomy........................................................................................................14
Single Instruction Multiple Data (SIMD) Architecture...................................................14
Generalized GPU Architecture.....................................................................................15
ATI Radeon 5870 Architecture.....................................................................................15
Memory Hierarchy of ATI 5870....................................................................................18

CHAPTER 5.....................................................................................................................20
GPGPU PROGRAMMING ENVIRONMENT...................................................................20
Introduction...................................................................................................................20
Compute Unified Device Architecture (CUDA)............................................................21
Open Computing Language (OpenCL)........................................................................21
Anatomy of OpenCL.....................................................................................................22
OpenCL Architecture....................................................................................................23
OpenCL Execution Model............................................................................................24
OpenCL Memory Model...............................................................................................25
OpenCL Program Structure..........................................................................................27

CHAPTER 6.....................................................................................................................28
FAST FOURIER TRANSFORM (FFT)............................................................................28
Introduction...................................................................................................................28
Fourier Transform.........................................................................................................28
Categories of Fourier Transform..................................................................................29
Discrete Fourier Transform (DFT)................................................................................29
Fast Fourier Transform (FFT).......................................................................................30
FFT Algorithms.............................................................................................................31
Radix-2 FFT Algorithm.................................................................................................31
Decomposition of Time Domain Signal........................................................................32
Calculating Frequency Spectra....................................................................................33
Frequency Spectrum Synthesis...................................................................................33
Reducing Operations Count.........................................................................................35

iv

RESTRICTED
RESTRICTED

CHAPTER 7.....................................................................................................................37
OPENCL IMPLEMENTATION........................................................................................37
Introduction...................................................................................................................37
Data Packaging............................................................................................................37
1D FFT Implementation................................................................................................37
Data Decomposition.....................................................................................................38
Parallel Implementation of Elster Algorithm.................................................................39
Butterfly Computations.................................................................................................39
Improved Program Structure........................................................................................41
2D FFT Implementation................................................................................................42
Matrix Transpose Implementation................................................................................42
Matrix Transpose Using Local Memory........................................................................44

CHAPTER 8.....................................................................................................................46
RESULTS AND CONCLUSION......................................................................................46
Computing Environment...............................................................................................46
Experiment Setup.........................................................................................................46
1D FFT Results............................................................................................................46
2D FFT Results............................................................................................................47
Analysis of Results.......................................................................................................48
Conclusion....................................................................................................................49

REFERENCES................................................................................................................50

LIST OF FIGURES

Figure 3-1: ATI Radeon HD 5870 Graphics Card.............................................................6


v

RESTRICTED
RESTRICTED

Figure 3-2: Layout of ATI 5870 Graphics Card.................................................................7


Figure 3-3: GPU Market Share..........................................................................................8
Figure 3-4: CPU vs GPU Peak Performance..................................................................11
Figure 3-5: CPU vs GPU Architecture Comparison........................................................12
Figure 4-1: SIMD Architecture.........................................................................................14
Figure 4-2: Simplified GPU Architecture.........................................................................15
Figure 4-3: Cypress Architecture.....................................................................................16
Figure 4-4: Cross Section of a Compute Unit..................................................................17
Figure 4-5: Stream Core..................................................................................................17
Figure 4-6: ATI 5870 Memory Hierarchy.........................................................................19
Figure 5-1: Compute Device............................................................................................24
Figure 5-2: OpenCL Memory Model................................................................................26
Figure 6-1: Interlaced Decomposition.............................................................................32
Figure 6-2: Bit Reversal Sorting......................................................................................33
Figure 6-3: Radix-2 FFT Butterfly....................................................................................34
Figure 6-4: FFT Flow Graph (N=8)..................................................................................35
Figure 6-5: Generalized FFT Butterfly.............................................................................35
Figure 6-6: Simplified FFT Butterfly.................................................................................36
Figure 7-1: Bit Reversed Vector......................................................................................38
Figure 7-2: Applying Elster on 16 Elements...................................................................40
Figure 7-3: Simplified FFT Butterfly.................................................................................41
Figure 7-4: Matrix Transpose..........................................................................................43
Figure 7-5: 4x4 Matrix Transpose Example....................................................................43
Figure 7-6: Result of Simple Matrix Transpose Kernel...................................................44
Figure 7-7: Coalesced Data Transfer from Global to Local Memory..............................45
Figure 7-8: Writing Transposed Elements to Global Memory.........................................45

DEDICATION

vi

RESTRICTED
RESTRICTED

I dedicate this report to my parents who have been a source of inspiration for me; to my
sisters who made me believe in myself when I thought that I could not make it; and
finally to my friends just for being there.

vii

RESTRICTED
RESTRICTED

ACKNOWLEDGEMENT

I am grateful to my advisor Sqn Ldr Tauseef ur Rehman for all the guidance and
support. I also like to thank the department of Avionics engineering for providing a
conducive environment for research and study.

viii

RESTRICTED
RESTRICTED

CHAPTER 1

PROJECT INTRODUCTION

Project Title

1. The aim of this project is to implement 1D and 2D Radix-2 Fast Fourier


Transform (FFT) algorithm on ATI Radeon 5870 Graphics Processing Unit (GPU) using
OpenCL programming language and benchmark the performance against state of the
art CPUs and NVIDIA GPUs.

Project Introduction

2. The Fourier Transform is a well known and widely used tool in many scientific
and engineering fields. Fourier transform converts a signal from time domain to
frequency domain. It is essential for many image processing techniques including
filtering, convolution, manipulation, correlation, and compression.

3. The Fast Fourier Transform (FFT) refers to a class of algorithms for efficiently
computing the Discrete Fourier Transform (DFT). FFT is a computationally intensive
algorithm and real time implementation of FFT on a general purpose CPU is challenging
due to limited processing power.

4. Graphics Processing Units (GPUs) are an emerging breed of processors


originating from the world of graphics and form an integral part of a commodity graphics
card on a Personal Computer (PC). Historically, GPUs have been used primarily for
graphics processing and gaming purposes. However, growing computational power of
GPUs has introduced the concept of General Purpose Computing on GPU (GPGPU)
[1]. In contrast to CPUs, GPUs have hundreds of processing cores working in parallel
and can handle real time parallel computations. Greater computational power and the
parallel architecture of GPUs help them outperform CPUs for data-parallel compute
applications by a huge factor.

RESTRICTED
RESTRICTED

5. NVIDIA and AMD are the leading manufacturers of GPUs. NVIDA has
introduced a GPU development platform, Compute Unified Device Architecture (CUDA)
for general purpose computing on GPUs. CUDA is not a cross-platform tool and its use
is limited only to NVIDIA GPUs [2]. NVIDIA has developed a GPU accelerated FFT
library called CUFFT. AMD GPUs cannot take advantage of this library.

6. Open Computing Language (OpenCL) is a developing, royalty-free standard for


cross-platform general-purpose parallel programming. OpenCL provides a uniform
programming environment for developing efficient and portable software for multi-core
CPUs and GPUs [3]. Both NVIDIA and AMD now support OpenCL for their respective
GPUs.

7. The aim of this project is to implement Radix-2 FFT algorithm on an AMD GPU
using OpenCL as the programming language. A state-of-the-art AMD GPU ATI Radeon
5870 will be used for implementation of this code. ATI Radeon 5870 can potentially
deliver a peak computational power of 2.72 Tera floating point operations per second
(FLOPS).

8. The focus of the project is to exploit data parallelism inherent in the FFT
algorithm and develop a code that exploits the AMD’s GPU architecture by fully utilizing
its parallel computing resources. Performance benchmarking of this implementation
against state-of-the-art Intel CPUs will be done.

RESTRICTED
RESTRICTED

CHAPTER 2

LITERATURE REVIEW

FFT on Graphics Hardware

1. The mathematical complexity of FFT suggests that it is a computationally


expensive task for uni-processor machines, especially when input size (N) is in millions.
The data-level parallelism of FFT algorithm can be exploited through a parallel
processing architecture to gain speed up.

2. With the introduction of GPGPU concept in 2002 the GPUs became an attractive
target for computation because of their high performance and low cost compared to
parallel vector machines and CPUs. At that time, general purpose algorithms for the
GPU had to be mapped to the programming model provided by graphics APIs, like
OpenGL and DirectX. The graphic APIs were unable to fully utilize the compute
resources as access to the low-level hardware features was not supported. The use of
graphics APIs for GPGPU was challenging and the performance gains were less
compared to the programming efforts.

3. K. Moreland and E. Angel [4] were among the first to implement FFT on graphics
hardware in 2003 using graphics APIs. Implementation was done on NVIDIA GeForce
FX 5800 Ultra graphics card. This graphics card features fully programmable pipeline
and full 32 bit floating point math enabled throughout the entire pipeline. Programming
environment used was OpenGL and the Cg computer language and runtime libraries.
The average performance achieved was 2.5 Giga FLOPS. J. Spitzer implemented FFT
on NVIDIA GeForce FX 5800 Ultra graphics card in 2003 [5], using the graphics APIs
and reported peak performance was 5 Giga FLOPS.

4. These FFT implementation on GPUs revealed that using graphics APIs for
GPGPU is inefficient and achievable peak performance was very limited as compared
to the programming efforts. Developers came up with the solution in the form of non-
graphics APIs to fully utilize the compute resources of GPU and reducing the
programming effort.

RESTRICTED
RESTRICTED

5. NVIDIA launched CUDA in 2007 [6], allowing the developers to fully utilize
immense GPU power by accessing all hardware features and resources via an
industrial standard high-level computer language C. With CUDA, the programming effort
was reduced and performance gains were much more significant. In February, 2007
NVIDIA launched first GPU accelerated FFT library, CUDA FFT Library (CUFFT) [7].

CUDA FFT Library (CUFFT)

6. CUFFT is the first GPU accelerated FFT library. Initial version of CUFFT was
released in February, 2007. The latest release is CUFFT version 3.0 launched in
February, 2010 [8]. The salient features of this library are listed below.

(a) 1D, 2D and 3D transforms of complex and real-valued data

(b) Batched execution of multiple transforms of any dimension in parallel

(c) 1D transforms size up to 8 Million elements

(d) 2D and 3D transform sizes in the range [2, 16384] in any dimension

(e) In-place and out-of-place transforms for real and complex data

(f) Double‐Precision transforms on compatible hardware

7. Implemented Algorithms. The CUFFT library implements several FFT


algorithms, each having different performance and accuracy. Radix-2 algorithm is
implemented for input sizes that are integral powers of 2. This corresponds to best
performance paths in CUFFT. For transform sizes that are not integral powers of 2,
CUFFT uses a more general Mixed-Radix FFT algorithm that is usually slower and less
numerically accurate [8].

8. CUFFT Limitation. The use of CUFFT is limited only to NVIDIA GPUs


because CUDA is not a heterogeneous programming environment. AMD GPUs cannot
take advantage of this library. Thus, there is still a need of accelerated FFT library
targeting GPUs from all vendors and multi-core CPUs.

RESTRICTED
RESTRICTED

Apple FFT Library

9. In February 2010, Apple Inc published FFT library for Mac OS X implementation
of OpenCL [9]. This FFT library includes all the features of CUFFT library. The runtime
requirements for this library are Mac OS X v10.6 or later with OpenCL 1.0 support and
Apple’s Xcode compiler. This limits the use of this library to Apple computers only.

10. OpenCL developer community modified the library to make it compatible with
AMD’s OpenCL implementation. The largest transform size achieved in this way is 1024
points with reported issues of numerical accuracy.

IPT ATI Project

11. In February 2010, Jingfei Kong published OpenCL implementation of FFT for
AMD ATI GPUs. This implementation accelerates MATLB FFT using MATLAB external
interface (MEX) [10]. This implementation only supports 1D transforms in single
precision.

Project Motivation

12. This project is a step towards the development of OpenCL FFT library for AMD
ATI GPUs which is comparable to CUFFT library in features and performance.
Contribution of this project will be implementation of 1D transforms, batched 1D
transforms and 2D transforms.

RESTRICTED
RESTRICTED

CHAPTER 3

GRAPHICS PROCESSING UNIT (GPU)

Introduction

1. Graphics Processing Unit (GPU). GPU is a specialized processor designed for


processing and displaying computer graphics. The terms graphics processing unit
and GPU were coined by NVIDIA, the largest GPU manufacturer, in 1999. GPUs are
highly efficient at performing the calculations necessary to generate visual output from
program data. They are widely used as co-processors in mobile phones, personal
computers, laptops and game consoles to offload the graphics processing from the
central processing unit (CPU) and to meet the ever increasing demand for better
graphics. GPUs commonly accompany standard CPUs in Personal Computers (PCs) to
accelerate graphics generation and video display. In a PC, a GPU can be present on
the motherboard or it can be on a dedicated Graphics Card.

2. Graphics Card. Graphics Card is a peripheral device and interfaces with the
motherboard by means of an expansion slot such as Peripheral Component
Interconnect Express (PCIe) Slot or Accelerated Graphics Port (AGP). Fig. 3-1 shows
ATI 5870 graphics card.

Figure 3-1: ATI Radeon HD 5870 Graphics Card

RESTRICTED
RESTRICTED

3. The key components of a graphics card are GPU, Video memory, Output
Interface and Motherboard Interface. Fig. 3-2 depicts the layout of ATI 5870 graphics
card [11].

Figure 3-2: Layout of ATI 5870 Graphics Card

4. Graphics cards offer added functions, such as video capture and decoding, TV
output, or the ability to connect multiple monitors. High performance graphics cards are
used for more graphically demanding purposes, such as PC games.

Major GPU Vendors

5. In 2008, Intel, NVIDIA and AMD/ATI were the market share leaders, with 49.4%,
27.8% and 20.6% market share respectively. However, these numbers include Intel's
very low-cost, less powerful integrated graphics solutions as GPUs [12].

6. In June 2010, 30 days GPU market share survey was done by passmark [13].
The survey was carried out for dedicated graphics cards only. According to this survey
NVIDIA and AMD/ATI are the leading GPU manufacturers with 49% and 34% market

RESTRICTED
RESTRICTED

share respectively, while Intel captures 10% of market share. Fig. 3-3 shows the results
of this survey.

Figure 3-3: GPU Market Share

Evolution of GPUs

7. The history of graphics processors traces back to 1980s when 2D graphics and
text displays were generated by graphic chips called accelerators.

8. The IBM Professional Graphics Controller was one of the very first 2D graphics
accelerators available for the IBM PC released in 1984 [12]. Its high price, slow
processor and lack of compatibility with commercial programs made it unable to
succeed in the mass-market.

9. In 1991, S3 Graphics introduced the first single-chip 2D accelerator, the S3


86C911 [12]. By 1995, all major PC graphics chip makers had added 2D acceleration
support to their chips. In the mid-1990s, CPU-assisted real-time 3D graphics were
becoming increasingly common in computer and console games, which led to an
increasing public demand for hardware-accelerated 3D graphics. Early examples of

RESTRICTED
RESTRICTED

mass-marketed 3D graphics hardware can be found in fifth generation video game


consoles such as PlayStation and Nintendo 64.

10. In 1997, 3D accelerators added another significant hardware stage for hardware
transforms and lightning to the 3D graphics pipeline. The NVIDIA GeForce 256 (NV10)
was the first card on the market with this capability in 1999 [12].

11. NVIDIA was first to produce a chip with programmable graphics pipeline in 2000,
GeForce 3 (NV20). By October 2002, ATI Radeon 9700 (R300), the world's first
Direct3D 9.0 accelerator, added floating point math capability to GPUs [12]. Since then
GPUs are quickly becoming as flexible as CPUs, and orders of magnitude faster for
image-array operations.

GPU Capabilities

12. Historically, GPUs have been used primarily for graphics processing and
gamming purposes. As graphics processing is inherently a parallel task, so GPUs
naturally have more parallel architecture as compared to standard CPUs. Furthermore,
the 3D video games demand very high computational power, driving the GPU
development beyond CPUs. Thus modern GPUs, in comparison to CPUs, offer
extremely high performance for the monetary cost.

13. Modern GPUs have a more flexible and programmable graphic pipeline and offer
high peak performance. Naturally, interest has been developed as to whether the GPU
processing power can be harnessed for more general purpose calculations.

General Purpose GPU (GPGPU)

14. The addition of programmable stages and higher precision arithmetic to the
graphics pipelines allows software developers to use GPUs for processing non-graphics
data. The idea of utilizing the parallel computing resources of a GPU for non graphics
general purpose computations is named as General-Purpose computation on Graphics
Processing Unit (GPGPU). The term GPGPU was coined by Mark Harris in 2002 when
he recognized an early trend of using GPUs for non-graphics applications [1].

RESTRICTED
RESTRICTED

15. GPGPU is the technique of using a GPU, which typically handles computation
only for computer graphics, to perform computation in applications traditionally handled
by the CPU. Once specially designed for computer graphics and difficult to program,
today’s GPUs are general-purpose parallel processors with support for accessible
programming interfaces and industry-standard languages such as C. Porting
applications to GPUs often achieve speedups of orders of magnitude vs. optimized CPU
implementations.

CPU Vs GPU

16. Performance Comparison. The motivation behind GPGPU is manifold. A


high end graphics card costs less than a high end CPU and provides peak performance
which is more then cent times the contemporary CPU. As depicted in Fig. 3-4 [14],
modern GPUs completely outperform state-of-the-art CPUs in theoretical peak
performance. Theoretical peak performance is the maximum number of floating point
operations per second (FLOPs).

17. CPU performance tops out at about 25 Giga FLOPs per core, thus Core i7 with 4
cores delivers a peak performance of 110 Giga FLOPS [15]. Whereas ATI Radeon 5870
delivers 2.72 Tera FLOPs and NVIDIA GTX 285 delivers 1 Tera FLOPs. This huge
performance advantage of GPUs justifies the GPGPU.

10

RESTRICTED
RESTRICTED

Figure 3-4: CPU vs GPU Peak Performance

18. Architecture Comparison. The architecture comparison of a quad core


CPU and a general GPU is shown in Fig. 3-5 [6]. The processing element in a
microprocessor where the floating point math occurs is an Arithmetic Logic Unit (ALU).
ALUs are shown as green blocks in Fig. 3-5. On quad core CPU the 4 ALUs can handle
4 Floating point operations simultaneously whereas the GPU shown here can handle
128 floating point operations simultaneously as it has 128 ALUs.

11

RESTRICTED
RESTRICTED

Figure 3-5: CPU vs GPU Architecture Comparison

19. The key difference between GPUs and CPUs is that while a modern CPU
contains a few high-functionality cores, GPUs typically contain 100s of basic cores.
Each CPU core is capable of running a heavy task independently. So multiple tasks can
map to different cores. The GPU core is a very basic processing element and each core
can perform same operation on different data simultaneously.

20. CPUs use a large chip area for control circuitry and data caching for faster data
accesses. GPUs, on the other hand, have very less area devoted for caches and control
circuitry. Most of the die area is occupied by the ALUs. Thus GPUs gain performance
advantage by allocating a huge number of transistors for floating point calculations.

21. GPUs also boast a larger memory bus width than CPUs which results in faster
memory access. The Dynamic RAM used in modern GPUs is GDDR5 with a much
greater bandwidth compared to DRAM in consumer PCs, generally GDDR2. CPUs
typically operate at 2-3 Ghz clock frequency. The GPU clock frequency is lower than
that of a CPU, typically up to 1.2 Ghz, but this gap has been closing over the last few
years.

12

RESTRICTED
RESTRICTED

GPU Application Areas

22. GPUs offer very high peak performance but all consumer applications can't take
full advantage of this performance. As GPUs are massively parallel devices, the
application needs to be highly parallel to fully utilize the GPU resources. Applications
such as graphics processing are highly parallel in nature, and can keep the cores busy,
resulting in a significant performance improvement over use of a standard CPU.

23. For applications less susceptible to such high levels of parallelization, the extent
to which the available performance can be harnessed will depend on the nature of the
application and the investment put into software development.

24. Following are the major application areas where GPUs provide significant
speedups over standard CPUs.

(a) Fast Fourier Transform (FFT)

(b) Image and Video Processing

(c) Multi Dimensional Signal Processing

(d) Particle Interaction and Fluid Dynamics Simulations

(e) Radar Signal Processing

(f) Linear Algebra (BLAS, LAPACK)

(g) Partial Differential Equations

(h) MATLAB Acceleration

13

RESTRICTED
RESTRICTED

CHAPTER 4

GPU ARCHITECTURE

Flynn's Taxonomy

1. Flynn's Taxonomy is a way to characterize the computer architecture. It


categorizes all computers according to the number of instruction streams and data
streams they have, where a stream is a sequence of instructions or data on which a
computer operates. There are four classes of computers as defined by Flynn's
Taxonomy [16].

(a) Single Instruction Single Data (SISD)

(b) Single Instruction Multiple Data (SIMD)

(c) Multiple Instruction Single Data (MISD)

(d) Multiple Instruction Multiple Data (MIMD)

Single Instruction Multiple Data (SIMD) Architecture

2. In a SIMD system, a single instruction stream is concurrently broadcast to


multiple processors, each with its own data stream. Each processor thus executes
same instruction or program on a different data set concurrently. GPU architecture is
based on SIMD model. Fig. 4-1 [17] shows a typical SIMD architecture.

Figure 4-1: SIMD Architecture

14

RESTRICTED
RESTRICTED

Generalized GPU Architecture

3. Fig. 4-2 [18] shows a simplified diagram of a generalized GPU device. A GPU
device comprises of a set of compute units. Each compute unit has a set of stream
cores which are further divided into basic execution units called processing elements.
All ATI GPUs follow similar design pattern, however number of compute units or stream
processors may vary from device to device.

Figure 4-2: Simplified GPU Architecture

ATI Radeon 5870 Architecture

4. ATI Radeon 5870 architecture is given the code name cypress [19]. Fig. 4-3
illustrates the cypress architecture. The GPU comprises of 20 compute units which
operate as SIMD engines. Each SIMD engine consists of 16 stream cores. A stream
core houses 5 processing elements.

5. SIMD Engine. Each compute unit consists of 16 stream cores and operates
as a SIMD engine. All stream cores within a SIMD engine have to execute the same

15

RESTRICTED
RESTRICTED

instruction sequence; different compute units may execute different instructions. Fig. 4-4
[19] shows the internal design of a compute unit.

Figure 4-3: Cypress Architecture

16

RESTRICTED
RESTRICTED

Figure 4-4: Cross Section of a Compute Unit

6. Stream Cores. A stream core consists of multiple processing elements for


floating point calculations. The branch unit shown in Fig. 4-5 handles branching
statements and thus allows threads to take different computation paths without incurring
overheads. The register file is extremely fast memory private to each thread and used
for fast data access and holding the intermediate variables.

Figure 4-5: Stream Core

17

RESTRICTED
RESTRICTED

7. Processing Elements. Processing elements are the fundamental


programmable computational units that perform integer, single-precision floating point,
double-precision floating-point and transcendental operations. A stream core is
arranged as a five-way very long instruction word (VLIW) processors, see Fig. 4-5. Up
to five scalar operations can be co-issued in a VLIW instruction, each of which is
executed on one of the corresponding five processing elements. Processing elements
can execute single-precision floating point or integer operations. One of the five
processing elements also can perform transcendental operations (sine, cosine,
logarithm, etc.) Double-precision floating point operations are processed by connecting
two or four of the processing elements (excluding the transcendental core) to perform a
single double-precision operation [18].

Memory Hierarchy of ATI 5870

8. GPUs generally feature a multi-level memory space for efficient data access and
communication within the GPU. Fig. 4-6 [19] illustrates the memory spaces on ATI
5870.

9. Global Memory. Global memory is the main memory pool on the GPU and is
accessible by all processing elements on the GPU for read and writes operations. ATI
5870 features 1GB GDDR5 global memory operating at 1.2 GHz. The data transfer rate
is 4.8 Gbps via 256 bit wide memory bus and memory bandwidth is153.6 GB/s. This is
the slowest memory on the GPU.

10. Constant Data Cache. ATI 5870 GPU features 48 KB of constant data cache
memory used to store the frequently used constant values. This memory is written by
the host and all processing elements have only read access to this memory. Constant
cache is a very fast memory with 4.25 TB/s memory bandwidth.

11. Local Data Share (LDS). Each SIMD engine has 32 KB local memory called
LDS. All processing elements within a SIMD can share data using this memory. LDS
offers 2.125 TB/s memory bandwidth providing low latency data access to each SIMD

18

RESTRICTED
RESTRICTED

engine. LDS is arranged into 32 banks, each with 1KB memory. LDS provides zero
latency reads in broadcast mode and in conflict free reads/writes.

Figure 4-6: ATI 5870 Memory Hierarchy

12. Registers. Registers are the fastest memory available on GPU. Each SIMD
engine posses 256 KB register files. Registers provide 13 TB/s memory bandwidth.
Registers are local to each processing element.

13. Global Data Share. ATI 5870 also features low latency global data share
allowing all processing elements to share data. This memory space is not available in
NVIDIA GPUs and old ATI GPUs. The size of GDS is 64 KB and memory access
latency is only 25 clock cycles.

19

RESTRICTED
RESTRICTED

CHAPTER 5

GPGPU PROGRAMMING ENVIRONMENT

Introduction

1. Early GPUs were designed to specifically implement graphics programming


standards such as OpenGL and Microsoft DirectX. The tight coupling between the
language used by graphics programmers and the graphics hardware ensured good
performance for most applications. However, this relationship limited the graphics-
processing realism to only that which was defined in the graphics language. To
overcome this limitation, GPU designers eventually made the pixel processing elements
customizable using specialized programs called graphics shaders [20].

2. Over time, developers and GPU vendors evolved shaders from simple assembly
language programs into high-level programs that create the amazingly rich scenes
found in today’s 3D software. To handle increasing shader complexity, the pixel
processing elements were redesigned to support more generalized math, logic, and flow
control operations. This set the stage for a new way to accelerate computation, the
GPGPU.

3. GPU vendors and software developers realized that the trends in GPU designs
offered an incredible opportunity to take the GPU beyond graphics. All that was needed
was a non-graphic Application Program Interface (API) that could engage the emerging
programmable aspects of the GPU and access its immense power for non graphics
applications.

4. First non-graphics API was introduced by NVIDIA in August 2007, named


Compute Unified Device Architecture (CUDA). NVIDIA actually devoted silicon area to
facilitate the ease of parallel programming, so this does not represent software changes
alone; additional hardware was added to the chip [6].

5. Open Computing Language (OpenCL) is another emerging standard for GPGPU


programming. OpenCL was proposed by Apple and has a broad industry support from

20

RESTRICTED
RESTRICTED

AMD, Intel, ARM Texas Instruments and many others. The OpenCL specifications are
managed by KHRONOS Group [3].

6. Microsoft DirectCompute is another API that supports GPGPU on Microsoft


Windows Vista and Windows 7. The DirectCompute architecture shares a range of
computational interfaces with its competitors; the KHRONOS Group's OpenCL and
NVIDIA's CUDA. Following sections of the report provide further details of these APIs.

Compute Unified Device Architecture (CUDA)

7. Introduction. CUDA is a parallel computing architecture developed by


NVIDIA. The programming language used to access the GPU resources is a subset of
widely used computer language C with extensions to support parallel processing. 'C for
CUDA' (C with NVIDIA extensions), is compiled through a PathScale Open64 C
compiler or the NVIDIA CUDA Compiler (NVCC) to generate a machine code for
execution on the GPU. CUDA works with all NVIDIA GPUs from the G8X series
onwards, including GeForce, Quadro and the Tesla line. The programs developed for
the GeForce 8 series also work without modification on all NVIDIA video cards, due to
binary compatibility. CUDA gives developers access to the native instruction set and
memory of the parallel computational elements in CUDA enabled GPUs. Using CUDA,
the latest NVIDIA GPUs effectively become open architectures like CPUs [6].

8. Limitations. The major limitation of CUDA is that it only targets NVIDIA


GPUs (Homogenous) [2]. OpenCL on the other hand is extremely heterogeneous and
targets not only GPUs from all vendors but also x86 CPUs, DSPs, CELL Engines and
hand held devices.

Open Computing Language (OpenCL)

9. Introduction. OpenCL (Open Computing Language) is the first open,


royalty-free standard for general-purpose parallel programming of heterogeneous
systems. OpenCL provides a uniform programming environment for software
developers to write efficient, portable code for high-performance compute servers,
desktop computer systems and handheld devices using a diverse mix of multi-core

21

RESTRICTED
RESTRICTED

CPUs, GPUs, Cell-type architectures and other parallel processors such as DSPs [3].
OpenCL will form the foundation layer of a parallel computing ecosystem of platform-
independent tools, middleware and applications.

10. OpenCL is being created by the KHRONOS Group with the participation of many
industry-leading companies and institutions including AMD, Apple, ARM, Electronic
Arts, Ericsson, IBM, Intel, Nokia, NVIDIA and Texas Instruments.

11. OpenCL language is based on C99 for writing programs that execute on OpenCL
devices, and APIs that are used to define and then control the platforms. OpenCL
provides parallel computing using task-based and data-based parallelism. Its
architecture shares a range of computational interfaces with two competitors, NVIDIA's
CUDA and Microsoft's DirectCompute.

12. OpenCL Advantages. OpenCL advantages over CUDA are summarized


below,

(a) Heterogeneous computing

(b) Code portability

(c) Support for task-parallelism

(d) Cross platform compatibility

Anatomy of OpenCL

13. KHRONOs Group released first OpenCL specifications in 2008 [21]. The
OpenCL 1.0 specification is made up of three main parts:

(a) Language specification

(b) Platform layer API

(c) Runtime API

22

RESTRICTED
RESTRICTED

14. Language Specifications. The language specification describes the


syntax and programming interface for writing compute programs that run on supported
devices, such as AMD GPUs and multi-core CPUs. The language used is based on a
subset of ISO C99. C was chosen as the basis for OpenCL due to its prevalence and
familiarity in the developer community. To foster consistent results across different
platforms, a well-defined IEEE 754 numerical accuracy is defined for all floating point
operations along with a rich set of built-in functions.

15. Platform Layer API. The platform layer API gives the developer access to
routines that query for the number and types of devices in the system. The developer
can then select and initialize the necessary compute devices to properly run their work
load. The required compute resources for job submission and data transfer are created
at this layer.

16. Runtime API. The runtime API allows the developer to queue up the work
for execution and is responsible for managing the compute and memory resources in
the OpenCL system.

OpenCL Architecture

17. OpenCL Platform Model. The OpenCL platform consists of a Host and
one or more Compute Devices. Host is CPU device running the main operating system.
Compute device is any CPU or GPU device which provides processing power for
OpenCL. OpenCL allows multiple heterogeneous compute devices to connect to a
single host and efficiently divide the work among them.

18. Compute Device. A compute device consists of a collection of one or


more compute units. A compute unit consists of one or more processing elements. Each
processing element executes same code in single instruction multiple data (SIMD)
fashion [18]. Fig. 5-1 shows a general layout of compute device.

23

RESTRICTED
RESTRICTED

Figure 5-1: Compute Device

OpenCL Execution Model

19. Since OpenCL is meant to target not only GPUs but also other accelerators, such
as multi-core CPUs, flexibility is given in specifying the type of task; whether it is data-
parallel or task-parallel. The OpenCL execution model includes Compute Kernels and
Compute Programs.

20. Compute Kernel. A compute kernel is the basic unit of executable code
and can be thought of as similar to a C function. Execution of such kernels can precede
either in-order or out-of-order depending on the parameters passed to the system when
queuing up the kernel for execution. Events are provided so that the developer can
check on the status of outstanding kernel execution requests and other runtime
requests.

21. Compute Program. A compute program is a collection of multiple


compute kernels and functions and is similar to a dynamic library. Both compute kernel
and program are executed on the compute device specified in the host code.

22. Computation Domain. In terms of organization, the execution domain of a


kernel is defined by an N-dimensional computation domain. This lets the system know
the problem size to which user would like to apply a kernel. Each element in the

24

RESTRICTED
RESTRICTED

execution domain is a work-item and OpenCL provides the ability to group together
work-items into work-groups for synchronization and communication purposes.

OpenCL Memory Model

23. OpenCL defines a multi-level memory model with memory ranging from private
memory visible only to the individual compute units in the device to global memory that
is visible to all compute units on the device. Depending on the actual memory
subsystem, different memory spaces are allowed to be collapsed together.

24. OpenCL 1.0 defines 4 memory spaces: private, local, constant and global [22].
Fig. 5-2 shows a diagram of the memory hierarchy defined by OpenCL.

25. Private Memory. Private memory is memory that can only be used by a single
processing element. No two processing elements can access each other's private
memory. This is similar to registers in a single CPU core. This is the fastest memory
available on the GPU but the size of private memory is very limited, generally in kilo
bytes.

26. Local Memory. Local memory is memory that can be used by the work-items
within a work-group. All work-items within a work-group can share data but data can't be
shared among different work-groups. Physically local memory is the local data share
(LDS) that is available on the current generation of GPUs. Each compute unit has its
own local memory shared among all the processing elements in that compute unit.
Local memory is also extremely fast but less as compared to private memory. The size
of LDS ranges from 16 KB to maximum of 48 KB on the latest hardware.

25

RESTRICTED
RESTRICTED

Figure 5-2: OpenCL Memory Model

27. Constant Memory. Constant memory is memory that can be used to


store constant data for read-only access by all of the compute units in the device during
the execution of a kernel. The host processor is responsible for allocating and initializing
the memory objects that reside in this memory space. This is similar to the constant
caches that are available on GPUs.

28. Global Memory. Global memory is memory that can be used by all the
compute units on the device. This is similar to the off-chip DRAM that is available on
GPUs. On latest GPUs GDDR5 DRAM is used as global memory and operates at a
bandwidth of 100GB/s and above.

26

RESTRICTED
RESTRICTED

OpenCL Program Structure

29. OpenCL program consists of a host code that executes on host processor and a
kernel code that executes on the compute device. A CPU device can be used both as
host and compute device.

30. OpenCL Host Code. Host code is a C/C++ program executing on the host
processor to augment the kernel code [22]. A sample host code includes following
steps.

(a) Create an OpenCL context

(b) Get and select the devices to execute kernel

(c) Create a command queue to accept the execution and memory requests

(d) Allocate OpenCL memory to hold the data for the compute kernel

(e) Online compile and build the compute kernel code

(f) Set up the arguments and execution domain

(g) Kick off compute kernel execution

(h) Collect the results

27

RESTRICTED
RESTRICTED

CHAPTER 6

FAST FOURIER TRANSFORM (FFT)

Introduction

1. The Fourier Transform is a well known and widely used tool in many scientific
and engineering fields. The Fourier transform is essential for many image processing
Techniques, including filtering, manipulation, correction, and compression. Fourier
analysis is a family of mathematical techniques, all based on decomposing signals into
sinusoids. The discrete Fourier transform (DFT) is the family member used with
digitized signals and form the basis of digital signal processing. FFT is an efficient
algorithm to compute the DFT and its inverse. This chapter provides a brief description
of Fourier analysis and its applications followed by a detailed description of DFT and
FFT.

Fourier Transform

2. Fourier Transform is named after Jean Baptiste Joseph Fourier (1768-1830), a


French mathematician and physicist. Fourier was interested in heat propagation, and
presented a paper in 1807 to the Institute de France on the use of sinusoids to
represent temperature distributions. The paper contained the controversial claim that
any continuous periodic signal could be represented as the sum of properly chosen
sinusoidal waves [23].

3. Fourier transform converts a signal from the spatial domain to the frequency
domain by representing it as a sum of properly chosen sinusoids. Spatial domain signal
is defined by amplitude values at specific time intervals. Frequency domain signal is
defined by amplitudes and phase shifts of the various sinusoids that make up the signal.

4. Sinusoidal Fidelity. Sinusoids are used to represent a signal in frequency


domain because the sinusoids are the easiest waveforms to work with due to Sinusoidal
Fidelity, which implies that if a sinusoid enters a linear system, the output will also be a

28

RESTRICTED
RESTRICTED

sinusoid, and at exactly the same frequency and shape as the input. Only the amplitude
and phase can change [23].

Categories of Fourier Transform

5. The general term Fourier transform, can be broken into four categories, resulting
from the four basic types of signals that can be encountered [23].

(a) Aperiodic-Continuous. These are continuous signals that do not


repeat in a periodic fashion. This includes, for example, decaying exponentials
and the Gaussian curve. These signals extend to both positive and negative
infinity without repeating in a periodic pattern. The Fourier Transform for this type
of signal is simply called the Fourier Transform.

(b) Periodic-Continuous. Continuous signals that repeat themselves


after fixed time interval. Examples include sine waves, square waves, and any
waveform that repeats itself in a regular pattern from negative to positive
infinity. This version of the Fourier transform is called the Fourier series.

(c) Aperiodic-Discrete. These signals are only defined at discrete


points between positive and negative infinity, and do not repeat themselves in a
periodic fashion. This type of Fourier transform is called the Discrete Time
Fourier Transform (DTFT).

(d) Periodic-Discrete. These are discrete signals that repeat


themselves in a periodic fashion from negative to positive infinity. This class of
Fourier Transform is sometimes called the Discrete Fourier Series, but is most
often called the Discrete Fourier Transform (DFT).

Discrete Fourier Transform (DFT)

6. DFT is one of the most important algorithms in Digital Signal Processing (DSP). It
converts a periodic-discrete time domain signal to periodic-discrete frequency domain

29

RESTRICTED
RESTRICTED

signal. As digital computers can only work with information that is discrete and finite in
length, the only Fourier transform that can be used in DSP is the DFT.

7. The input to the DFT is a finite sequence of real or complex numbers, making the
DFT ideal for processing information stored in computers. In particular, the DFT is
widely employed in signal processing and related fields to analyze the frequencies
contained in a sampled signal, to solve partial differential equations, and to perform
other operations such as convolutions or multiplying large integers.

8. Mathematical Complexity. The DFT is defined by the formula shown in


Eq. (6.1).

N −1
X k = ∑ x (n)e− j(2 πKn)/ N … … … … … … .. Eq .(6.1)
n=0

where N is the total number of input points (samples) and Xk represents the DFT of

time domain signal xn, where i is the imaginary unit and is a primitive N'th root of
unity and is called the Twiddle Factor (W) . Evaluating this definition directly requires
N2 operations as there are N outputs Xk, and each output requires a sum of N terms.
Thus mathematical complexity of DFT is O(N2) [24].

Fast Fourier Transform (FFT)

9. FFT was introduced by J.W. Cooley and J.W. Tukey in 1965 [25]. FFT is
very efficient method to compute the DFT in O(N log N) operations. FFT reduces the
operations count by eliminating O(N) trivial operations such as multiplications by 1. The
difference in speed can be substantial, especially for long data sets where N may be in
the thousands or millions. In practice, the computation time can be reduced by several
orders of magnitude in such cases, and the improvement is roughly proportional to
N/log(N). This huge improvement made many DFT-based algorithms practical.

30

RESTRICTED
RESTRICTED

FFT Algorithms

10. Cooley-Tukey FFT Algorithm. By far the most common FFT is the Cooley–
Tukey algorithm. This is a divide and conquer algorithm that recursively breaks down a
DFT of size N into two pieces of size N / 2 at each step, and is therefore limited to
power-of-two sizes. Commonly used variants of this frame work are listed here.

(a) Radix-2

(b) Radix-4

(c) Split Radix

11. Prime-Factor Algorithm (PFA). PFA introduced by Good and Thomas [26], is
based on the Chinese Remainder Theorem to factorize the DFT similarly to Cooley-
Tukey but without the twiddle factors.

12. Rader-Brenner Algorithm. The Rader-Brenner algorithm [27], is a


Cooley-Tukey-like factorization but with purely imaginary twiddle factors, reducing
multiplications at the cost of increased additions and reduced numerical stability. It was
later superseded by the split-radix variant of Cooley-Tukey which achieves the same
multiplication count but with fewer additions and without sacrificing accuracy.

Radix-2 FFT Algorithm

13. Radix-2 FFT algorithm recursively divides the N point input signal to two N/2
point signals at each stage. This recursion occurs until the signal is divided into 2 point
signals, hence the name Radix-2. The input size N must be an integral power of two to
apply this recursion. FFT is computed in log 2 N stages. Total complex multiplications are
(N/2)log 2 N and complex additions are Nlog 2 N [28].

14. Radix-2 FFT algorithm computes FFT in following three steps.

(a) Decompose an N point time domain signal into N time domain signals
each composed of a single point.

31

RESTRICTED
RESTRICTED

(b) Calculate the N frequency spectra corresponding to these N time domain


signals.

(c) Synthesize N spectra into a single frequency spectrum.

Decomposition of Time Domain Signal

15. Interlaced Decomposition. N point time domain signal is decomposed into


N time domain signals each composed of a single point. Decomposition is done in
stages using Interlaced Decomposition at each stage [28]. Interlaced decomposition
breaks the signal into its even and odd numbered samples at each stage. log 2 N Stages
are required to decompose N point signal to N single point signals. Fig. 6-1 shows how
decomposition works.

16. Bit Reversal Sorting. The interlaced decomposition is done using a bit
reversal sorting algorithm. This algorithm rearranges the order of the N time domain
samples by counting in binary with the bits flipped left-for-right as shown in Fig. 6-2.
This algorithm provides the same output as calculated in Fig. 6-1 using interlaced
decomposition.

Figure 6-1: Interlaced Decomposition

32

RESTRICTED
RESTRICTED

Calculating Frequency Spectra

17. The second step in computing the FFT is to calculate the frequency spectra of N
time domain signals of single point each. According to the Duality Principle the
frequency spectrum of a 1 point time domain signal is equal to itself [28]. A single point
in the frequency domain corresponds to a sinusoid in the time domain. By duality, the
inverse is also true; a single point in the time domain corresponds to a sinusoid in the
frequency domain. Thus nothing is required to do this step and each of the 1 point
signals is now a frequency spectrum, and not a time domain signal.

Figure 6-2: Bit Reversal Sorting

Frequency Spectrum Synthesis

18. To synthesize a single frequency spectrum from N frequency spectra, the N


frequency spectra are combined in the exact reverse order that the time domain
decomposition took place. The synthesis process is done one stage at a time. In the
first stage, 16 frequency spectra (1 point each) are synthesized into 8 frequency spectra
(2 points each). In the second stage, the 8 frequency spectra (2 points each) are

33

RESTRICTED
RESTRICTED

synthesized into 4 frequency spectra (4 points each), and so on. The last stage results
in the output of the FFT, a 16 point frequency spectrum.

19. Radix-2 FFT Butterfly. The butterfly is the basic computational element of
the FFT, transforming two complex points input into two complex point output. Fig. 6-3
shows a Radix-2 FFT butterfly.

20. For N=2, the calculations for FFT are shown in Eq. (6.2) through Eq. (6.4).

1
X k =∑ x (n)e− j (2 πKn)/ 2 … … … … … ….. Eq .(6.2)
n=0

X ( 0 ) =x ( 0 )+ x ( 1 ) … … … … … … … … . Eq .(6.3)

X ( 1 )=x ( 0 )−x ( 1 ) … … … … … … … … . Eq .(6.4)

21. Figure 6.5 shows the flow graph for FFT calculations. W represents the twiddle
factor. Butterfly shown in Fig. 6-5 requires two complex multiplications and two complex
additions.

Figure 6-3: Radix-2 FFT Butterfly

22. This butterfly pattern is repeated over and over to compute entire frequency
spectrum. Flow graph for 8 point FFT is shown in Fig. 6-6.

34

RESTRICTED
RESTRICTED

Figure 6-4: FFT Flow Graph (N=8)

Reducing Operations Count

23. The flow graph in Fig. 6-5 depicts the operations count to be O(N 2). FFT reduces
the operation count by exploiting the symmetry property of twiddle factor (W) and
eliminating the trivial multiplications. The general FFT butterfly is shown in Fig6-7.

Figure 6-5: Generalized FFT Butterfly

24. Symmetry Property. Applying the symmetry property defined as Eq. (6.5)
and Eq. (6.6) on the twiddle factors in Fig. 6-7, reduces the complex multiplications by a

35

RESTRICTED
RESTRICTED

factor of two [29]. Fig. 6-8 shows a simplified butterfly with one complex multiplication
and two complex additions.

WS+N/2 = WNS WNN/2 = -WNS ……………………Eq. (6.5)

WNN/2 = e-jπ = cos(-π) + jsin(-π) = -1 ……………Eq. (6.6)

25. Thus the operation count of FFT is reduced to (N/2)log 2 N complex multiplications
and Nlog 2 N complex additions. The overall complexity is O(Nlog 2N).

Figure 6-6: Simplified FFT Butterfly

36

RESTRICTED
RESTRICTED

CHAPTER 7

OPENCL IMPLEMENTATION

Introduction

1. The previous chapter (chapter 6) introduced Radix-2 FFT algorithm and provided
the mathematical details of the algorithm. This chapter describes the handling of
complex numbers and how the mathematics is implemented to compute FFT on the
GPU using OpenCL programming language.

Data Packaging

2. For generality, Complex to Complex FFT is implemented. OpenCL 1.0


specifications do not support complex numbers. As there is no special data type for
handling complex numbers, the data needs to be packaged to manage the complex
arithmetic in OpenCL.

3. The real and complex parts of input data are single precision floating point
values. Each complex number can be stored as a two element vector, where each
element is a floating point value. OpenCL supports vector data types with floating point
elements. Float2 is a built in data type in OpenCL which stores two floating point values
in a vector. This data type is well suited for handling complex data type in OpenCL.

4. The host code copies the real and imaginary input values to a buffer in the global
memory of GPU. A kernel is then launched to package the floating point real and
imaginary values to Float2 vector data type. First element of each Float2 vector is the
real part and second element is the corresponding imaginary part.

1D FFT Implementation

5. As discussed in chapter 6, the Radix-2 algorithm computes FFT in two major


steps.

37

RESTRICTED
RESTRICTED

(a) Data Decomposition (Bit Reversal Sorting)

(b) Butterfly Computations (Frequency spectrum synthesis)

Data Decomposition

6. Data decomposition in FFT is achieved using bit reversal algorithm. Computing


the bit reversal permutation efficiently is a problem on its own. In general, bit reversal
methods are divided in two main classes: in-place bit reversals, and indirect addressing
methods [30]. The first ones rearrange the input vector x into its bit reversal order. This
is normally achieved through nested sequences of stride permutations. This method is
not efficient for parallel architectures due to lot of branching.

7. Indirect addressing methods, in turn, do not reorder x but compute instead a


vector representation of the bit reversal permutation. For example, for N=8, the vector
representation will be as shown in Fig. 7-1.

Figure 7-1: Bit Reversed Vector

8. Elster Algorithm. One of the most efficient methods for producing a vector
representation of the bit reversal is Elster's algorithm [31]. The purpose of this algorithm
is to create a vector B with the bit reversal permutation values. Elster computes the N-
point bit reversal vector in log 2(N) steps. For example, for N=8 the vector representation
is shown in Table 7-1:

38

RESTRICTED
RESTRICTED

Table 7-1: Elster Algorithm Calculations

Initial Value 0

+4 0 4

+2 0 4 2 6

+1 0 4 2 6 1 5 3 7

Parallel Implementation of Elster Algorithm

9. Elster method can be parallelized by dividing the total length N by the number of
parallel processes and calculating each block on a separate processing element.
Assuming each block size to be M, total blocks thus formed are N/M. Each block can be
computed in parallel on a separate processing element. As GPUs have large number of
processing elements this implementation maps well to the GPU architecture.

10. Initial point of each block “head” is pre calculated by applying Elster on Total
blocks. Fig. 7-2 shows an example of Elster on 16 elements. Head is computed by bit
reversing 4 element vector [0 1 2 3]. Applying Elster on this vector results in [0 2 1 3].
The four blocks are computed in parallel on separate processing element, thus
speeding up the calculations.

Butterfly Computations

11. The FFT butterfly is derived by expanding the formula of DFT shown as Eq. (7.1).

N −1
X k = ∑ e− j (2 πKn)/ N … … … … … … … .. Eq .(7.1)
n=0

12. After bit reversal sorting of input data, one stage of 2 point butterflies can be
launched in parallel. Fig. 7-3 shows the simplified FFT butterfly. This butterfly includes
one complex multiplication, two complex additions and a sign change.

39

RESTRICTED
RESTRICTED

Figure 7-2: Applying Elster on 16 Elements

13. Simple Program Structure. To calculate the FFT of N points, the simplest
approach is to bit reverse the input data and launch 2 point FFT kernels at each stage
to calculate the entire frequency spectrum. This approach requires log2N kernel
launches with the input stride increasing by a factor of two at each kernel launch.

14. Limitations. The 2 point FFT butterfly requires only 25 ALU operations.
Each thread of 2 point FFT kernel reads 4 floating point values (16 Bytes). The ALU to
Fetch ratio is very less degrading the kernel performance. The ALU to Fetch ratio is a
kernel performance parameter. It is a ratio of the time taken by ALU operations to the
time spent in importing the data. For high kernel performance this ratio should be high.
Doing more calculations per thread can improve the performance.

40

RESTRICTED
RESTRICTED

Figure 7-3: Simplified FFT Butterfly

Improved Program Structure

15. The ALU to Fetch ratio can be improved by calculating 2 or more FFT stages in a
single kernel launch. This approach reduces the total number of kernel launches, thus
reducing the overheads incurred in issuing a kernel call.

16. Hard-Coded FFT Kernels. FFT kernels for 2 points, 4 points and 8 points
are hard-coded in the program. The code us developed by expanding the DFT formula
for N = 2, 4 and 8. 16 point FFT kernel is not hard-coded because it consumes more
register files per thread, thus reducing the total number of concurrent threads, resulting
in performance degradation.

17. Twiddle Factor Calculation. The twiddle factor (W) for each FFT stage is
calculated on the fly. The mathematical equation for calculating twiddle factor is derived
as follows.

W N =e− j (2 πn)/ N … … … … … … … Eq .(7.2)

e− jθ =cosθ− jsinθ … … … … …. Eq .(7.3)

18. Applying the identity in Eq. (7.3) to Eq. (7.2) yields the final equation Eq. (7.4)
used to calculate twiddle factor is in the kernel.

W N =cos ( 2Nπn )− jsin( 2Nπn ) … … … … … .. Eq.(7.4)


41

RESTRICTED
RESTRICTED

19. Computing Large FFT. FFT of length 16 and above are computed by invoking
a set of hard-coded, small length, FFT kernels. FFT of 1024 points requires 10 stages
and is calculated by launching 8 point FFT kernel thrice, completing 9 stages, followed
by a 2 point FFT kernel for 10th stage. The largest FFT that can be computed using this
implementation is 8 million (224) points.

2D FFT Implementation

20. 2D FFT is implemented by computing 1D FFT on rows followed by 1D FFT on


columns of the input matrix. 2D input matrix is stored in row major order in computer
memory. A matrix is converted to row major order by placing each successive row next
to the previous in a 1D array.

21. For GPU, computing FFT on columns is very expensive due to large input strides
in accessing data for calculations. A modified technique of implementing 2D FFT
introduces a matrix transpose stage after 1D FFT on rows. Thus 2D FFT can be
calculated in following steps.

(a) 1D FFT on rows of matrix

(b) Transpose the matrix

(c) 1D FFT on rows of transposed matrix

(d) Transpose the matrix to restore natural order

Matrix Transpose Implementation

22. Matrix Transpose. Transpose of a matrix A is another matrix A t in which


each row has been interchanged with the respective column. A matrix transpose, thus
swaps the elements along each axis of the matrix as depicted in Fig. 7-4.
Mathematically, transpose of a matrix A i,j is defined by Eq. (7.5), where i and j are the
indices along X and Y axis respectively.

42

RESTRICTED
RESTRICTED

At = A j ,i … … … … … Eq .(7.5)

23. Simple Transpose Kernel. In a simple transpose kernel each thread reads
one element of the input matrix from global memory and writes back the same element
at its transposed index in the global memory. Fig. 7-5 shows a 4x4 matrix with index of
each element.

Figure 7-4: Matrix Transpose

43

RESTRICTED
RESTRICTED

Figure 7-5: 4x4 Matrix Transpose Example

23. Limitations of Simple Transpose Kernel. The result of applying simple


matrix transpose kernel to a 4x4 matrix is shown in Fig. 7-6. All threads read
consecutive elements from global memory, so reading from global memory is
coalesced. Coalesced loads are the most efficient way of accessing global memory
because the data is loaded as a single chunk. All global memory access should be
coalesced for optimal performance. Each thread writes out the result in a non-coalesced
fashion, as the threads are non writing consecutive elements. Each non-coalesced
access to global memory is serviced individually and multiple accesses are serialized
causing long latency [32].

Figure 7-6: Result of Simple Matrix Transpose Kernel

44

RESTRICTED
RESTRICTED

24. The simple transpose kernel is inefficient due to non-coalesced memory


accesses and needs some modifications to coalesce all memory accesses. This can be
achieved using the local memory available on GPU.

Matrix Transpose Using Local Memory

25. Local memory or LDS is an extremely fast memory available on GPUs providing
zero latency data accesses, ideally. Each SIMD engine on ATI 5870 owns 32 KB LDS.
Data in LDS can be accessed in any pattern without performance penalty if there are no
bank conflicts [32]. Bank conflicts occur when two or more threads try to read same
piece of data from LDS. Using LDS in transpose kernel can improve the performance by
a huge factor.

26. The improved transpose kernel reads the input data from global memory in a
coalesced fashion and writes it in the local memory. Fig. 7-7 shows coalesced data
transfer from global to local memory.

Figure 7-7: Coalesced Data Transfer from Global to Local Memory

27. Once the data is loaded in local memory each thread can access the transposed
data element and write it out to the global memory in a coalesced fashion. Fig. 7-8
shows how consecutive threads write the transposed elements to the global memory

45

RESTRICTED
RESTRICTED

Figure 7-8: Writing Transposed Elements to Global Memory

28. The use of local memory in matrix transpose kernel provides a speedup of over
20 times as compared to the un-optimized kernel.

CHAPTER 8

RESULTS AND CONCLUSION

1. OpenCL code for 1D and 2D FFT is developed using the methodology discussed
in the previous chapter. This chapter summarizes the results of 1D and 2D FFT
implementation and the benchmarks against high-end Intel CPUs.

Computing Environment

2. Hardware. The GPU performance is benchmarked against two high-end Intel


CPUs. The hardware details are as follows

(a) GPU. ATI Radeon HD 5870

(b) CPU-1. Core-i7 930 @ 2.80 GHz

(c) CPU-2. Core 2 Quad 8200 @ 2.33 GHz

3. Software. The operating system installed on both CPUs is Windows 7 64-Bit


and CPU performance results are obtained on MATLAB R2009b.

Experiment Setup

4. CPU performance results are obtained by computing FFT of random values in


MATLAB and measuring the execution time using commands tic and toc. All CPU
times presented in Table 8-1 and Table 8-2 are averaged over 50 iterations.

46

RESTRICTED
RESTRICTED

5. The GPU FFT time is calculated using a ATI Stream Profiler, a software
provided by AMD for accurate performance measurement. ATI stream profiler can
measure time with 1 nano second precision [33].

1D FFT Results

6. This FFT implementation supports power of two transform sizes up to 8 million


points (223). Table 8-1 summarizes the performance comparison over the input size
range. FFT time is the kernel execution time of FFT. The total time includes the memory
transfer time between CPU and GPU over the PCIe bus.

Table 8-1: Performance Benchmark for 1D FFT

Device ATI Radeon 5870 CPU 1 CPU 2


Core-i7 930 Core 2 Quad

Input Size FFT Time Total Time FFT Time FFT Time
(x1000 points) (ms) (ms) (ms) (ms)

1 0.04398 0.5037 0.0064 0.0357

4 0.05753 0.5077 0.0264 0.1592

16 0.06822 0.6446 0.1125 0.7829

64 0.13872 0.6501 0.4666 3.3158

256 0.42907 1.36833 3.1224 18.278

1024 1.77403 7.19109 17.036 87.589

4096 7.69335 33.5992 134.728 548.41

8192 15.62233 68.87037 273.503 1192.2

47

RESTRICTED
RESTRICTED

2D FFT Results

6. The largest 2D FFT input size supported by this implementation is 2048x2048


22
(2 ) point. Table 8-2 summarizes the performance comparison over the input size
range. FFT time is the kernel execution time of FFT. The total time includes the memory
transfer time between CPU and GPU over the PCI bus.

Table 8-2: Performance Benchmark for 2D FFT

Device ATI Radeon 5870 CPU 1 CPU 2


Core-i7 930 Core 2 Quad

Input Size FFT Time Total Time FFT Time FFT Time
(Points) (ms) (ms) (ms) (ms)

64x64 0.07556 0.41039 0.085 0.2982

128x128 0.11226 0.55073 0.243 1.3016

256x256 0.20268 0.68671 0.965 5.6619

512x512 0.63373 1.55834 4.288 28.576

1024x1024 3.81680 9.27949 11.395 155.24

2048x2048 27.4293 54.0063 72.13 648.38

Analysis of Results

7. The results in Table 8-1 and Table 8-2 show that OpenCL FFT implemented on
ATI 5870 GPU outperforms the state of the art Intel CPUs. The performance advantage
increases with increase in the input size. This is because of increasing parallelism as

48

RESTRICTED
RESTRICTED

the input size increases. The data transfer time between GPU and motherboard is a
factor which limits the performance gains.

8. GPU Bottleneck. The GPU is connected to the motherboard through a PCIe


bus. The PCIe bus has a memory bandwidth of only 5.2 GB/s. Thus the data transfer
between CPU and GPU is very expensive as evident from the performance results
presented in Table 8-1 and 8-2.

9. AMD Fusion Project. AMD has recently launched a new compute


architecture having both CPU and the GPU on a single chip [34]. This architecture is
named as Accelerated Processing Unit (APU). Having both CPU and the GPU on a
single chip removes the PCIe interface thus removing the data transfer bottleneck.

Conclusion

10. GPU is novel parallel computing architecture providing huge speedups to data-
parallel computations at a relatively lower monetary cost. The performance gains
depend upon the exploitable parallelism in the algorithm being implemented.

11. OpenCL is a heterogeneous parallel computing environment allowing the


developers to tap into the immense computing power of GPUs for general purpose
computations. OpenCL provides code portability across a range of compute devices.

12. OpenCL implementation of FFT on ATI 5870 GPU outperforms the latest Intel
CPUs by exploiting data-level parallelism inherent in FFT algorithm. Both 1D and 2D
FFT algorithms have been implemented successfully.

13. Currently, the data transfer to GPU over the PCIe interface presents a
performance bottleneck. The AMD Fusion APUs remove this bottleneck opening new
horizons of parallel computing on GPUs.

49

RESTRICTED
RESTRICTED

REFERENCES

[1] http:// gpgpu.org/about


[2] http://www.streamcomputing.nl/blog/difference-between-cuda-and-opencl
[3] http://www.khronos.org/opencl
[4] K. Moreland and E. Angel, "The FFT on a GPU," in Proceedings of the ACM
SIGGRAPH Conference on Graphics Hardware, 2003, pp. 112–119.
[5] J. Spitzer, “Implementing a GPU-efficient FFT,” SIGGRAPH Course on
Interactive Geometric and Scientific Computations with Graphics Hardware,
2003.
[6] David Kirk/NVIDIA and Wen-mei Hwu, "CUDA Programming Book".
[7] http://developer.download.nvidia.com/compute/CUFFT_Library_0.8.pdf
[8] http://developer.download.nvidia.com/compute/CUFFT_Library_3.0.pdf
[9] http://developer.apple.com/mac/library/samplecode/OpenCL_FFT.html
[10] J. Kong, et. al., “Accelerating MATLAB Image Processing Toolbox Functions on
GPUs”, GPGPU-3, March, 2010.
[11] http://ixbtlabs.com/articles3/video/cypress-p2.html
[12] http://en.wikipedia.org/wiki/Graphics_processing_unit
[13] http://www.videocardbenchmark.net/30dayshare.html
[14] Jimmy Pettersson, Ian Wainwright, "Radar Signal Processing with GPUs" SAAB,
Master Thesis, 2010.
[15] http://www.hpcwire.com/features/Compilers_and_More_GPU_Architecture_and_
Applications
[16] http://en.wikipedia.org/wiki/Flynn's_taxonomy
[17] Timothy g. Mattson, Beverly A. Sanders, "Patterns for Parallel Programming" pp.
15.
[18] "ATI Stream SDK - OpenCL Programming Guide" Advanced Micro Devices,
2010.
[19] "OpenCL™ and the ATI Radeon™ HD 5870 Architecture", Advanced Micro
Devices.

50

RESTRICTED
RESTRICTED

[20] "OpenCL Technology Brief" Apple Inc.


[21] www.khronos.org/registry/cl/specs/opencl-1.0.29.pdf
[22] "An Introduction to OpenCL™", www.amd.com.
[23] Steven W. Smith, "The Scientist and Engineer's Guide to Digital Signal
Processing," Ch 8, pp.140-146, 1997.
[24] Douglas L. Jones, Ivan Selesnick, "The DFT, FFT, and Practical Spectral
Analysis," pp. 42 Rice University, Houston, Texas, 2007.
[25] Cooley, James W. and John W. Tukey, 1965, "An algorithm for the machine
calculation of complex Fourier series," Math. Compute. 19: 297–301.
[26] I.J. Good, L. H. Thomas, "Using a computer to solve problems in physics," in
Applications of Digital Computers, Boston, 1963.
[27] C. M. Rader, "Discrete Fourier transforms when the number of data samples is
prime", IEEE Proceedings, 1968, pp. 1107–1108.
[28] Steven W. Smith, "The Scientist and Engineer's Guide to Digital Signal
Processing," 1997, Ch 12, pp. 225-235.
[29] http://www.cmlab.csie.ntu.edu.
[30] Dániza C. Morales Berrios, "A Parallel Bit Reversal Algorithm and it's CILK
Implementation," University of Puerto Rico, Mayagüez Campus, 1999.
[31] Elster A.C. “Fast Bit-Reversal Algorithm”, ICASSP 89’ Proccedings,1989.
[32] David W. Gohara, "OpenCL Memory Layout and Access", Washington
University School of Medicine, St. Louis, September 2009.
[33] André Heidekrüger Sr. System Engineer Graphics, "AMD/ATI Stream Computing
on GPU", 1T-Systems HPCN-Workshop, 2010.
[34] http://www.fusion.amd.com

51

RESTRICTED

You might also like