Benchmarking The NVIDIA 8800GTX With The CUDA Development Platform
Benchmarking The NVIDIA 8800GTX With The CUDA Development Platform
Benchmarking The NVIDIA 8800GTX With The CUDA Development Platform
Parallel computations consistently outperform the CPU, Performance asymptotically twice as fast as Sameh-Kuck
although by a smaller factor in test 2, the short filter case, should be attainable via pipelining [5]. However, this
where most of the data fits in CPU cache. Series GPU approach was not implemented; the graphics card provides
computations underperform the CPU only in test 2. In test limited thread synchronization and no atomicity2, such that
5, where the problem size is very large, the series an effective pipelining approach would reduce the ratio of
algorithm's calculations take only slightly longer than the computations to memory access.
parallel algorithm, though the total series time remains CUDA Programmability
somewhat slower than the total parallel time.
The CUDA paradigm, in which the GPU is a SIMD
For the time-domain benchmark, Table 3 reports CPU time processor array, makes an efficient tradeoff between
and the ratio of total CPU time to GPU time1. GPU time- general-purpose and specialized computation. A single line
domain computation significantly outperforms the CPU on of code invokes a device function with a specified thread
large data sets and it is competitive on the short filter test organization. The card tightly interleaves these threads’
case. It is not competitive with GPU frequency-domain computations and memory accesses. There is little
filtering. In limited testing, GPU time-domain filtering complexity overhead: the GPU benchmarks have roughly as
outperforms the parallel GPU frequency-domain filter with many lines of code as their HPEC Challenge counterparts.
large M and N and small K, for instance by a factor of three
with N=4096, K=12, M=1000. Data-parallel design for CUDA follows well-understood
SIMD patterns. The key challenges of the architecture are
QR Decomposition: Benchmark Overview structuring expensive memory access calls appropriately
In the QR benchmark, an mxn matrix A is factorized into and avoiding complicated synchronization requirements.
an mxm unitary matrix Q and an upper triangular matrix R. The two NVIDIA-supplied libraries – CUFFT, for fast
The matrices A, Q, and R contain complex, single-precision Fourier transforms, and CUBLAS, a set of basic linear
floating point numbers. The following properties hold after algebra subroutines – were readily adapted to FIR and QR
QR factorization: respectively. CUFFT peak performance of 35 Gflop/s was
H measured with direct tests, but these frequency-domain
A=QR ;Q Q= I (2)
results represent at best 10 Gflop/s performance. CUBLAS
The HPEC reference implementation performs QR via is incomplete; the library implements some operations, such
Givens rotations. A Givens rotation selectively zeroes an as Givens rotations, for real numbers but not for complex
element of the target matrix A by updating two of its rows. numbers. These shortcomings suggest avenues for future
In the Fast Givens QR algorithm [3], the rotations necessary exploration. Nevertheless, these libraries are, like CUDA
to triangularize A into R are directly computed into Q. itself, tremendously convenient; they easily, effectively
exploit the GPU’s parallel capabilities.
The parallel approach used on the GPU employs Givens
rotations in the standard Sameh-Kuck concurrency strategy References
[4]. This pattern concurrently zeroes elements that are a
[1] NVIDIA Corporation, “NVIDIA CUDA Compute Unified
knight's move apart; see figure 1 of [4]. Device Architecture Programming Guide”, Version 1.0, 23
Seven sets of test data are parametrized in Table 4; the first June 2007.
three sets follow the HPEC QR benchmark. [2] J. Lebak, A. Reuther, and E. Wong, “Polymorphous
Table 4: QR Test Parameters Computing Architecture (PCA) Kernel-Level Benchmarks,”
MIT Lincoln Laboratory project report PCA-KERNEL-1, 13
1 2 3 4 5 6 7 June 2005.
M 500 180 150 1000 1000 2000 2000 [3] G. H. Golub and C. F. Van Loan, Matrix Computations, Third
N 100 60 150 500 1000 1000 2000 Edition, Johns Hopkins University Press, 1996.
Table 5: QR Test Results [4] A. H. Sameh and D. J. Kuck, “On Stable Parallel Linear
System Solvers,” Journal of the ACM, Vol. 25 No. 1, Jan.
Test 1 2 3 4 5 6 7
1978 p. 81-91.
CPU time (s.) 0.79 0.062 0.087 26 65 120 209
[5] M. Hofmann and E. J. Kontoghiorghes, “Pipeline Givens
GPU total (ratio) 1.7 0.90 0.81 2.6 2.6 2.4 2.6 Sequences for computing the QR decomposition on an EREW
CPU time and the ratio of CPU time to GPU time are PRAM,” Parallel Computing, Vol. 32 No. 3, March 2006.
reported. GPU time includes the cost of copying memory
2
between card and host; this is less than 1% of the total time. Atomic operations on integers exist in CUDA Compute Capability 1.1 on
the newer but slower GeForce 8600 and 8500 cards. Atomic operations on
1
In each time-domain test, total GPU time was no more than 7% greater than floating-point numbers are not supported in any existing hardware but are
calculation time. referenced in the high-level PTX assembler language used by the card.