Benchmarking The NVIDIA 8800GTX With The CUDA Development Platform

Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

Benchmarking the NVIDIA 8800GTX with the CUDA Development Platform

Michael P. McGraw-Herdeg, Massachusetts Institute of Technology, [email protected]


Douglas P. Enright, The Aerospace Corporation, [email protected]
B. Scott Michel, The Aerospace Corporation, [email protected]

Abstract The benchmarks were implemented using NVIDIA’s


CUDA SDK, which is a collection of C extensions and a
Two HPEC Challenge benchmarks, finite impulse response runtime library. CUDA’s functionality primarily allows a
and QR decomposition, were implemented on a NVIDIA developer to write C functions to be executed on the GPU.
8800 GTX graphics card using a data-parallel CUDA also includes memory management and execution
implementation approach. For the finite impulse response configuration; with CUDA, a developer can control the
filter bank benchmark, a fast convolution FFT-based number of GPU processors and threads that are to be
frequency-domain approach on the GPU performed 4 to 35 invoked during a function’s execution.
times faster than the comparable calculation on a CPU. A
non-transform time-domain approach outperformed the The test system, running Gentoo Linux, contained a dual-
comparable CPU calculation by a factor of 1.6 to 15. When core Athlon-64 4200+ running at 2210 Mhz, 2GB of
computing the QR decomposition of a complex matrix, memory, and a PCI Express x16 bus. Each Athlon core had
GPU computations are consistently 2.5 times faster than the 128KB of Level 1 cache and 512KB of Level 2 cache. The
CPU. All of these parallel algorithms were written in code was compiled with gcc v4.0.4 and nvcc v0.2.1221.
NVIDIA's Compute Unified Device Architecture (CUDA),
a C interface that provides quick, effective parallelization.
Finite Impulse Response: Benchmark Overview
The FIR benchmark models a set of M filters which operate
Hardware and Software on a set of M distinct input signals of length N. Each filter
The NVIDIA 8800 GTX video card has 16 multiprocessors, has K coefficients. Signal and filter elements are complex,
each composed of 8 SIMD processors operating at 1350 single-precision floating point numbers. The output of filter
Mhz [1]. Each multiprocessor has 8192 registers, a 16KB m∈ {0, 1, ..., M-1} is given by convolution:
parallel data cache of fast “shared memory,” and access to K−1
768 MB of GDDR3 “global memory.” The card is used
most efficiently in a data-parallel fashion, when the ratio of
∑ x m [i−k ]w m [k ]for i=0, 1,... , N −1 (1)
k =0
computations to memory access is high and when many A time-domain implementation of the FIR filter computes
computations are performed concurrently. this convolution directly and uses 8*M*N*K floating-point
Table 1: FIR Test Parameters operations. A frequency-domain approach is often preferred
since convolution in the time domain is multiplication in
1 2 3 4 5 the frequency domain, and consequently the computation
time does not depend on filter size. This approach computes
N 4096 1024 4096 4096 32768 the FFT of the signal and filter, multiplies the transformed
K 128 12 4096 128 4096 signal and filter, then inverts the transformation. It requires
M(10*N*log2N + 8*N) operations [2].
M 64 20 128 512 128 Five sets of test data were chosen, with the first two taken
Table 2: FIR Frequency-Domain Test Results from the HPEC benchmark; the parameters are in Table 1.
Test 1 2 3 4 5 The FIR filter was implemented on the card in three ways.
CPU time (s) 0.12 0.0080 0.24 0.95 2.4 The first, a series approach, performs the frequency-domain
task one signal at a time, using the card’s FFT capability as
GPU series calculation (ratio) 1.8 0.50 1.6 1.8 16 accessed through NVIDIA’s CUFFT library. The second, a
GPU series total (ratio) 1.5 0.43 1.3 1.5 9.2 parallel approach, uses the NVIDIA CUFFT library’s
GPU parallel calculation (ratio) 26 4.4 35 24 17 “batch mode” to directly process all the signals at once. The
third, a time-domain filter, performs the convolution
GPU parallel total (ratio) 14 3.3 13 13 12
directly; it uses NVIDIA's CUBLAS library to dispatch
Table 3: FIR Time-Domain Test Results each of the required convolution calculations as a Level 1
Test 1 2 3 4 5 BLAS caxpy operation.
CPU time (s) 0.71 0.0052 45 5.6 375 FIR Results
GPU total (ratio) 6.9 1.6 7.4 7.2 15 For the CPU, the time recorded in Tables 2 and 3 is that
To conserve space, only CPU time and the ratio of CPU reported by the HPEC benchmark. For the GPU, the “total
time to GPU time are reported. The GPU performs faster ratio” reported in Tables 2 and 3 accounts for both GPU
than the CPU when the reported ratio exceeds one. computation time and the cost of moving the input and
output data between the host computer and the card. The QR Results
time required to perform the computations alone, with the
required data already placed in the card's global memory, is On data sets much larger than the CPU's L1 cache, the GPU
reported as the “calculation ratio” in Table 2. consistently outperforms the CPU by a factor of about 2.5.

Parallel computations consistently outperform the CPU, Performance asymptotically twice as fast as Sameh-Kuck
although by a smaller factor in test 2, the short filter case, should be attainable via pipelining [5]. However, this
where most of the data fits in CPU cache. Series GPU approach was not implemented; the graphics card provides
computations underperform the CPU only in test 2. In test limited thread synchronization and no atomicity2, such that
5, where the problem size is very large, the series an effective pipelining approach would reduce the ratio of
algorithm's calculations take only slightly longer than the computations to memory access.
parallel algorithm, though the total series time remains CUDA Programmability
somewhat slower than the total parallel time.
The CUDA paradigm, in which the GPU is a SIMD
For the time-domain benchmark, Table 3 reports CPU time processor array, makes an efficient tradeoff between
and the ratio of total CPU time to GPU time1. GPU time- general-purpose and specialized computation. A single line
domain computation significantly outperforms the CPU on of code invokes a device function with a specified thread
large data sets and it is competitive on the short filter test organization. The card tightly interleaves these threads’
case. It is not competitive with GPU frequency-domain computations and memory accesses. There is little
filtering. In limited testing, GPU time-domain filtering complexity overhead: the GPU benchmarks have roughly as
outperforms the parallel GPU frequency-domain filter with many lines of code as their HPEC Challenge counterparts.
large M and N and small K, for instance by a factor of three
with N=4096, K=12, M=1000. Data-parallel design for CUDA follows well-understood
SIMD patterns. The key challenges of the architecture are
QR Decomposition: Benchmark Overview structuring expensive memory access calls appropriately
In the QR benchmark, an mxn matrix A is factorized into and avoiding complicated synchronization requirements.
an mxm unitary matrix Q and an upper triangular matrix R. The two NVIDIA-supplied libraries – CUFFT, for fast
The matrices A, Q, and R contain complex, single-precision Fourier transforms, and CUBLAS, a set of basic linear
floating point numbers. The following properties hold after algebra subroutines – were readily adapted to FIR and QR
QR factorization: respectively. CUFFT peak performance of 35 Gflop/s was
H measured with direct tests, but these frequency-domain
A=QR ;Q Q= I (2)
results represent at best 10 Gflop/s performance. CUBLAS
The HPEC reference implementation performs QR via is incomplete; the library implements some operations, such
Givens rotations. A Givens rotation selectively zeroes an as Givens rotations, for real numbers but not for complex
element of the target matrix A by updating two of its rows. numbers. These shortcomings suggest avenues for future
In the Fast Givens QR algorithm [3], the rotations necessary exploration. Nevertheless, these libraries are, like CUDA
to triangularize A into R are directly computed into Q. itself, tremendously convenient; they easily, effectively
exploit the GPU’s parallel capabilities.
The parallel approach used on the GPU employs Givens
rotations in the standard Sameh-Kuck concurrency strategy References
[4]. This pattern concurrently zeroes elements that are a
[1] NVIDIA Corporation, “NVIDIA CUDA Compute Unified
knight's move apart; see figure 1 of [4]. Device Architecture Programming Guide”, Version 1.0, 23
Seven sets of test data are parametrized in Table 4; the first June 2007.
three sets follow the HPEC QR benchmark. [2] J. Lebak, A. Reuther, and E. Wong, “Polymorphous
Table 4: QR Test Parameters Computing Architecture (PCA) Kernel-Level Benchmarks,”
MIT Lincoln Laboratory project report PCA-KERNEL-1, 13
1 2 3 4 5 6 7 June 2005.
M 500 180 150 1000 1000 2000 2000 [3] G. H. Golub and C. F. Van Loan, Matrix Computations, Third
N 100 60 150 500 1000 1000 2000 Edition, Johns Hopkins University Press, 1996.
Table 5: QR Test Results [4] A. H. Sameh and D. J. Kuck, “On Stable Parallel Linear
System Solvers,” Journal of the ACM, Vol. 25 No. 1, Jan.
Test 1 2 3 4 5 6 7
1978 p. 81-91.
CPU time (s.) 0.79 0.062 0.087 26 65 120 209
[5] M. Hofmann and E. J. Kontoghiorghes, “Pipeline Givens
GPU total (ratio) 1.7 0.90 0.81 2.6 2.6 2.4 2.6 Sequences for computing the QR decomposition on an EREW
CPU time and the ratio of CPU time to GPU time are PRAM,” Parallel Computing, Vol. 32 No. 3, March 2006.
reported. GPU time includes the cost of copying memory
2
between card and host; this is less than 1% of the total time. Atomic operations on integers exist in CUDA Compute Capability 1.1 on
the newer but slower GeForce 8600 and 8500 cards. Atomic operations on
1
In each time-domain test, total GPU time was no more than 7% greater than floating-point numbers are not supported in any existing hardware but are
calculation time. referenced in the high-level PTX assembler language used by the card.

You might also like