GPU Versus FPGA For High Productivity Computing: Imperial College London, Electrical and Electronic Engineering, London
GPU Versus FPGA For High Productivity Computing: Imperial College London, Electrical and Electronic Engineering, London
GPU Versus FPGA For High Productivity Computing: Imperial College London, Electrical and Electronic Engineering, London
j=i
F
i,j
(2)
Where m is the mass of each body, x is the position
of each body and G is a scaling factor (we use G =
0.001).
3) Synchronise each thread
4) Calculate the new velocities of each body using the
following equation:
v
t+1,i
= v
t,i
+
F
i
m
i
.t (3)
5) Calculate the new positions of each body using the
following equation:
x
t+1,i
= x
t,i
+ v
t+1,i
.t (4)
Where t is the unit time, (we use t = 1).
6) Synchronise each thread
7) Repeat to (1) 100 times.
This task has been chosen in part because it is a common
scientic model and in part because it requires a fully
synchronised mesh process architecture. All the calculations
were performed to 32-bit precision. Figure 9 shows the
performance improvement over a reference CPU implemen-
tation.
Figure 9. Performance of GPU and FPGA architectures doing N-body
simulations
The only way it is possible to globally synchronise
the threads of the GPU was by stopping the kernel and
restarting it. For this benchmark, the performance of the
GPU is on average 43.2 times greater than the CPU. The
cost of completely synchronising the GPU has reduced its
performance relative to the other benchmarks. However,
123 123 123
it still signicantly outperforms the HC-1, which ran the
benchmarks on average 1.9 times faster than the CPU. The
improved performance on the GPU of systems between 4800
and 9600 bodies is due to the model using a more efcient
allocation of threads between the 240 cores.
A similar comparison by Tsoi and Luk [17] using cus-
tomised hardware and rmware concluded that an FPGA-
based n-body simulation can run 2 faster than a
GPU. We adapted this simulation to better imitate his work
(simulating 81920 bodies for one iteration in 3D) and re-
ran our simulations. Our GPU simulation ran slightly faster
(7.8s versus 9.2s) and our compiled Convey code ran much
slower than their custom hardware and rmware (37.9s
versus 5.62s). Thus if the development of custom hardware
and rmware do not signicantly reduce the productivity of
a simulation, FPGA based HPCS can still outperform (1.4
faster) our GPU software.
VIII. CONCLUSIONS
We have evaluated the performance of the Convey HC-1
and the Nvidia GTX285 against a range of tasks common to
HPCS-based scientic research. In all cases, both platforms
outperformed an equivalent CPU implementation. For most
of these the GPU signicantly outperformed the FPGA
architecture. The one exception, the generation of pseudo-
random numbers, used closed-source rmware customised
for both the task and the platform. We suggest that without
a standardised FPGA HPCS platform about which open-
source rmware could be developed, the future for FPGA-
based HPCS will be increasingly marginalised to special-
ist applications. Further, the only people both sufciently
equipped and capable to develop the necessary rmware
for FPGA-based HPCS will be the hardware developers.
Supporting this conclusion is the fact that Cray no longer sell
FPGA-based supercomputers and their latest product line
(CX1) instead uses Nvidia GPUs.
IX. ACKNOWLEDGEMENTS
The authors acknowledge the support received from EP-
SRC on grants EP/C549481 and EP/E045472. The authors
also thank Nvidia and Convey Computer for their assistance.
REFERENCES
[1] J. Kepner, HPC productivity: An overarching view, Interna-
tional Journal of High Performance Computing Architectures,
vol. 18, no. 4, pp. 393, 2004.
[2] S. Craven and P. Athanas, Examining the viability of FPGA
supercomputing, EURASIP Journal on Embedded systems,
vol. 2007, no. 1, pp. 13, 2007.
[3] T. Takagi and T. Maruyama, Accelerating HMMER search
using FPGA, Field Programmable Logic and Applications,
2009. FPL 2009. International Conference on, pp. 332337,
2009.
[4] M. Chiu and M. C. Herbordt, Efcient particle-pair ltering
for acceleration of molecular dynamics simulation, Field
Programmable Logic and Applications, 2009. FPL 2009.
International Conference on, 2009.
[5] N. Wolter, M. O. McCracken, A. Snavely, L. Hochstein,
T. Nakamura, and V. Basili, Whats working in HPC:
Investigating HPC user behavior and productivity, CTWatch
Quarterly, November 2006.
[6] Open FPGA Alliance, OpenFPGA general API specication
0.4, www.openfpga.org, 2008.
[7] Convey computer, The Convey HC-1: The worlds rst
hybrid-core computer, www.conveycomputer.com, 2009.
[8] D. K. G. Campbell, Towards the classication of algorith-
mic skeletons, Technical Report YCS 276, Department of
Computer Science, University of York, 1996.
[9] D. M. Goodeve, Performance of multiprocessor communi-
cations networks, PhD Thesis, University of York, 1994.
[10] W. D. Smith and A. R. Schnore, Towards an RCC-based
accelerator for computational uid dynamics applications,
The Journal of Supercomputing, vol. 30, no. 3, pp. 239261,
2004.
[11] SRC computers, SRC-7 MAPstation product sheet, src-
comp.com.
[12] A. J. van der Steen, Overview of recent supercomputers,
phys.uu.nl, 2005.
[13] Tony Brewer, Instruction set innovations for the Convey
HC-1 computer, conveycomputer.com.
[14] M. Matsumoto and T. Nishimura, Mersenne twister: a 623-
dimensionally equidistributed uniform pseudo-random num-
ber generator, ACM Transactions on Modeling and Computer
Simulation (TOMACS), vol. 8, no. 1, pp. 330, 1998.
[15] X. Tian and K. Benkrid, Mersenne twister random number
generation on FPGA, CPU and GPU, Proceedings of
NASA/ESA Conference on Adaptive Hardware and Systems,
2009.
[16] Altera, Designing and using FPGAs for double-precision
oating-point math, Altera white paper, 2007.
[17] K. Tsoi and W. Luk, Axel: A heterogeneous cluster
with FPGAs and GPUs, Proceedings of the 18th an-
nual ACM/SIGDA international symposium on Field pro-
grammable gate arrays, pp. 115124, 2010.
124 124 124