0% found this document useful (0 votes)
3 views9 pages

NewScientific Programming - 2010 - Jespersen - Acceleration of A CFD Code With A GPU

This paper discusses the acceleration of the CFD code OVERFLOW using GPU technology, specifically focusing on modifying an algorithm that significantly impacts computational time. It highlights the challenges of adapting the code for GPU execution, including considerations for 32-bit and 64-bit arithmetic, and the potential for improved performance on both GPU and CPU. The study emphasizes the need for parallelism in computational tasks and the benefits of using CUDA for GPU programming.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views9 pages

NewScientific Programming - 2010 - Jespersen - Acceleration of A CFD Code With A GPU

This paper discusses the acceleration of the CFD code OVERFLOW using GPU technology, specifically focusing on modifying an algorithm that significantly impacts computational time. It highlights the challenges of adapting the code for GPU execution, including considerations for 32-bit and 64-bit arithmetic, and the potential for improved performance on both GPU and CPU. The study emphasizes the need for parallelism in computational tasks and the benefits of using CUDA for GPU programming.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Scientific Programming 18 (2010) 193–201 193

DOI 10.3233/SPR-2010-0309
IOS Press

Acceleration of a CFD code with a GPU


Dennis C. Jespersen
NASA/Ames Research Center, Moffett Field, CA, USA
Tel.: +1 650 604 6742; Fax: +1 650 604 4377; E-mail: [email protected]

Abstract. The Computational Fluid Dynamics code OVERFLOW includes as one of its solver options an algorithm which is a
fairly small piece of code but which accounts for a significant portion of the total computational time. This paper studies some
of the issues in accelerating this piece of code by using a Graphics Processing Unit (GPU). The algorithm needs to be modified
to be suitable for a GPU and attention needs to be given to 64-bit and 32-bit arithmetic. Interestingly, the work done for the GPU
produced ideas for accelerating the CPU code and led to significant speedup on the CPU.
Keywords: GPU, CUDA, acceleration, OVERFLOW code

1. Introduction was difficult, with a steep learning curve. In order


to make GPU programming more accessible, several
Computational Fluid Dynamics (CFD) has a his- projects were initiated to develop programming lan-
tory of seeking and requiring ever higher computa- guages that could exploit the power of GPUs. Some ef-
tional performance. This quest has in the past partly forts in this area are Brook and BrookGPU [10] and
been satisfied by faster CPU clock speeds. The era CUDA [5]. A standardization effort is OpenCL (Open
of increasing clock rates has reached a plateau, due Computing Language) [9].
mainly to heat dissipation constraints. A boost in com- The enhanced accessibility of GPUs has led to much
putational performance without increasing clock speed recent work in the area of general-purpose comput-
can be supplied by parallelism. This parallelism can ing on GPUs. A GPU can produce a very high flop/s
come in the form of task parallelism, data parallelism (floating-point operations per second) rate if an algo-
or perhaps a combination of the two. Common current rithm is well suited for the device. There have been
paradigms for implementing parallelism are explicit several studies illustrating the acceleration of scientific
message-passing with MPI [6] for either distributed or computing codes that is possible by using GPUs [2,13,
shared memory systems and OpenMP [7] for shared 17]. In this paper we study the issues in accelerating a
memory systems. A hybrid paradigm is also possible, well-known CFD code, OVERFLOW, on a GPU.
using OpenMP within multiprocessor nodes and MPI The viewpoint taken here is that the GPU acts as a
for inter-node communication. co-processor to the CPU. The contemporary CPU typ-
A Graphics Processing Unit (GPU) is a processor ically is a 4-core processor. Serial or modestly parallel
specialized for graphics rendering. The development of (less than 10 threads, say) parts of a code should be ex-
GPUs has been mostly driven by the demand for fast ecuted on the CPU while massively parallel parts of a
high-quality graphics images for computer games. The code should be executed on the GPU. The paradigm is:
large market for computer games provided resources prepare data on the CPU, transfer to the GPU, execute
for rapid development of more powerful GPUs. The on the GPU and transfer results back from the GPU
nature of the graphical display task, highly parallel ren- to the CPU. Note that this paradigm implies that per-
dering of many pixels, drove the development of paral- formance measures and timing comparisons between a
lel hardware with high computational capability. pure CPU and a CPU + GPU combination should in-
GPUs were originally programmed in specialized clude the time required for data transfer to and from
graphics programming languages. A few researchers the GPU.
realized that some non-graphical computing problems A characteristic of GPUs inherited from their use
could be expressed as graphical programming tasks, strictly as graphics rendering engines is the weak sup-
allowing the power of the GPU to be used in a non- port for 64-bit arithmetic. At the time this work was
graphical environment. Using a GPU in this manner performed, 64-bit arithmetic was an order of magni-

1058-9244/10/$27.50 © 2010 – IOS Press and the authors. All rights reserved
5192, 2010, 3-4, Downloaded from https://onlinelibrary.wiley.com/doi/10.3233/SPR-2010-0309 by City University Of Hong Kong, Wiley Online Library on [26/06/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
194 D.C. Jespersen / Acceleration of a CFD code with a GPU

tude slower than 32-bit arithmetic. This was a strong The actual programming language for the CUDA ar-
factor impelling the examination of judicious use of chitecture is C with a small set of extensions. These ex-
32-bit arithmetic. tensions consist of keywords to describe where a func-
tion executes and where data reside. In addition each
thread has predefined variables that describe its loca-
2. GPU environment tion in its batch and the location of its batch in the
grid of multiprocessors. Finally, there is syntax denot-
For our purposes here the key issues of GPUs are ing the execution of a “kernel” (function which exe-
massive parallelism, ideally with thousands of threads, cutes on the device) and the configuration of the device
32-bit floating-point arithmetic, and the overhead of for the kernel. So a CUDA program consists of C code
data traffic between the CPU and GPU. For a GPU to interspersed with calls to one or more kernel functions.
successfully accelerate a code segment the code must There is library support for device initialization, trans-
be amenable to large-scale parallelism, must contain fer of data to and from the device and other miscella-
enough computational work to amortize the cost of neous activities. An example of a CUDA program is
transferring data from the CPU to the GPU and trans- given in the Appendix.
ferring results back from the GPU and should toler- Attractive features of CUDA are the gentle learning
ate 32-bit floating-point arithmetic. (Recent GPU hard- curve, wide availability of CUDA-capable devices, and
ware supports some 64-bit arithmetic but 32-bit arith- large and active user community. A potential weakness
metic is significantly faster.) We will see that the SSOR is the tie to a single vendor, resulting in lack of porta-
algorithm in OVERFLOW is not well suited to a GPU, bility. The OpenCL project [9] is an open standard for
but that a Jacobi version of the algorithm might be suit- parallel programming of heterogenous systems. When
able for a GPU. this work was begun the OpenCL standard had not
This work used CUDA (Compute Unified Device been finalized.
Architecture) to support the use of the GPU. CUDA, A significant feature of CUDA programming, which
developed by NVIDIA Corp., is a combination of might be seen as either a strength or weakness, is the
hardware and software that enables several models of necessity for the programmer to specify the placement
NVIDIA graphics cards to be used as general-purpose of data in global and shared memory, which effec-
processors. The “device” (GPU) is physically a set of tively amounts to a requirement to explicitly manage
multiprocessors, say on the order of 20 multiproces- the cache. This goes along with the necessity to explic-
sors, each of which is itself a set of cores, perhaps itly define the layout of the GPU (number of blocks,
8–32 cores, giving a total of a few hundred cores. number of threads) and gives CUDA programming a
Logically, the device is organized as a one- or two- distinct feeling of writing at a low level, close to the
dimensional array of blocks and each block is struc- hardware. The mapping of a numerical algorithm onto
tured as a one-, two- or three-dimensional array of the GPU can often be defined in a variety of ways,
threads. All the threads in a specific block execute in each of which might give different performance, and
a specific multiprocessor. All threads execute the same the only way to evaluate the performance of the various
code, each thread operating on its own data, so this alternatives is by actual coding and testing.
is a data-parallel programming paradigm. There is a
global memory accessible to all threads. In addition,
threads in a given multiprocessor can cooperate via a 3. OVERFLOW code
“shared memory” (effectively a cache). Threads in a
given multiprocessor can synchronize with one another The OVERFLOW code [4,11,12,16] is intended for
but there is no synchronization across multiprocessors. the solution of the Reynolds-averaged Navier–Stokes
The shared memory is small and accesses to it have equations with complex geometries. The code uses fi-
low latency. The global memory is large but with high nite differences on logically Cartesian meshes. The
latency. One can increase bandwidth to global mem- meshes are body-fitted and geometric complexity is
ory by properly grouping (“coalescing”) memory re- handled by allowing the meshes to arbitrarily overlap
quests. One hopes that global memory latency can be one another.
covered by using a large number of threads and switch- OVERFLOW uses implicit time-stepping and can be
ing quickly from one thread to another if a thread is run in time-accurate or steady-state mode. Implicit
blocked waiting for data. time-stepping is used because implicit methods tend to
5192, 2010, 3-4, Downloaded from https://onlinelibrary.wiley.com/doi/10.3233/SPR-2010-0309 by City University Of Hong Kong, Wiley Online Library on [26/06/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
D.C. Jespersen / Acceleration of a CFD code with a GPU 195

mitigate severe stability limits on the size of the time which may change for each class of flow problem, re-
step that arise for explicit time-stepping methods on cently another option was added for the implicit part of
highly-stretched grids; such grids are common for vis- OVERFLOW, with the hope that it would be widely ap-
cous flow problems at high Reynolds numbers. A con- plicable and would be almost universally usable [14].
sequence of implicit time-stepping is that some method In the reference, this algorithm is referred to as an
is needed to (approximately) solve the large system of SSOR algorithm, but it is strictly speaking a mix of an
equations that arises when advancing from one time SSOR algorithm and a Jacobi algorithm, so it might be
level to the next. called a “quasi-SSOR” algorithm.
The OVERFLOW user needs to specify physical flow The key step of the algorithm is as follows. At each
inputs, such as Mach number and Reynolds number, grid point with index (j, k, l) one computes a residual
and boundary conditions which typically define solid Rjkl and six 5 × 5 matrices A+ , B + , C + , A− , B − ,
walls and inflow or outflow regions. Along with these C − ; these matrices depend on the flow variables at the
physics-type inputs there are inputs which choose par- neighboring grid points and are fixed during the SSOR
ticular numerical algorithms and specify parameters iterations. Then, with iteration stage denoted by a su-
for them. perscript n and with a relaxation parameter ω, relax-
The basic equation of fluid motion solved by ation steps are of the form:
OVERFLOW is of the form:
ΔQn+ 1 n
jkl = (1 − ω)ΔQjkl
∂Q/∂t + L(Q) = f (Q), (1)

+ ω Rjkl − A+ n
jkl ΔQj−1,k,l
where Q is the vector of flow variables, L(Q) denotes
+
all the spatial differencing terms, and f (Q) denotes − Bjkl ΔQn+ 1 + n+1
j ,k−1,l − Cjkl ΔQj ,k,l−1
terms from boundary conditions and possible source
terms. − A− n − n
jkl ΔQj+1,k,l − Bjkl ΔQj ,k+1,l
Equation (1) is discretized and written in “delta − 
form” [1,15]: − Cjkl ΔQn
j ,k,l+1 (3)

A(ΔQn+1 ) = R(Qn ), (2) for a forward sweep (assuming the 5-vectors ΔQn+ 1
j ,k−1,l
and ΔQn+ 1
j ,k,l−1 have been computed, and updating all
where A = A(Qn ) is a very large sparse matrix which
is not explicitly constructed, ΔQn+1 = Qn+1 − Qn is ΔQjkl as soon as a full line of j values has been com-
the vector of unknowns, and R(Qn ) involves the dis- puted) and a step of the form:
cretization of the L(Q) and f (Q) terms at time level n.
The user of OVERFLOW must choose among several ΔQn+ 1 n
jkl = (1 − ω)ΔQjkl
possible discretizations (e.g., central differencing, To- 
+ ω Rjkl − A+ n
jkl ΔQj−1,k,l
tal Variation Diminishing, Roe upwind). Each of these
choices typically requires further user specification of +
− Bjkl ΔQn + n
j ,k−1,l − Cjkl ΔQj ,k,l−1
numerical parameters such as dissipation parameters or
type of flux limiter and parameters for the limiter. Fi- − A− n − n+1
jkl ΔQj+1,k,l − Bjkl ΔQj ,k+1,l
nally, the user needs to decide which implicit algorithm 

to use: some choices are approximate factored block − Cjkl ΔQn+ 1
j ,k,l+1 (4)
tridiagonal, approximate factored scalar pentadiagonal
or LU-SGS (approximate LU factorization with Sym- for a backward sweep (again assuming ΔQn+ 1
j ,k+1,l and
metric Gauss–Seidel iteration). Over the years the code n+1
ΔQj ,k,l+1 have been computed). The forward/back-
evolved and expanded to incorporate six basic choices
for the implicit part of the algorithm. ward pair is then iterated. This algorithm is not strictly
speaking an SSOR algorithm; it is Jacobi in j and
SSOR in k and l. We will refer to it as SSOR for sim-
4. The SSOR algorithm in OVERFLOW plicity. This algorithm needs the values of ΔQ at the six
nearest spatial neighbors of grid point (j, k, l), some of
In an attempt to ease the user’s burdensome task of them at iteration level n and some of them at iteration
selecting algorithm options and choosing parameters level n + 1.
5192, 2010, 3-4, Downloaded from https://onlinelibrary.wiley.com/doi/10.3233/SPR-2010-0309 by City University Of Hong Kong, Wiley Online Library on [26/06/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
196 D.C. Jespersen / Acceleration of a CFD code with a GPU

The SSOR algorithm is a modest-sized subroutine pute independently of one another because there are no
but testing shows it may consume 80% of the total run- ΔQn+1 terms on the right-hand side of (5).
time of the code, so it is a computational hot spot. The It is important to realize that the Jacobi algorithm
modest size of the subroutine and the large fraction of might be less robust or might converge slower than the
total time consumed by the algorithm make using a original SSOR algorithm. Fully discussing this would
GPU as a co-processor to accelerate the code an attrac- take us too far afield, though we will show some con-
tive idea. vergence comparisons of Jacobi and SSOR.
Unfortunately, the algorithm as it stands is not suited The work presented here proceeded in several
to a GPU due to the dependencies of the iteration, stages:
namely ΔQn+1 appears on the right-hand side of 1. Implement a Jacobi algorithm on the CPU using
Eqs (3) and (4). An algorithm that would be suited to a 64-bit arithmetic; compare performance and con-
GPU would be a Jacobi algorithm with relaxation steps vergence/stability of Jacobi and SSOR.
of the form: 2. Implement a Jacobi algorithm on the CPU using
32-bit arithmetic; compare performance and con-
ΔQn+ 1 n
jkl = (1 − ω)ΔQjkl vergence/stability of 64-bit and 32-bit Jacobi.
 3. Implement a Jacobi algorithm on the GPU; com-
+ ω Rjkl − A+ n
jkl ΔQj−1,k,l pare performance of the GPU algorithm with the
+ + 32-bit CPU algorithm.
− Bjkl ΔQn n
j ,k−1,l − Cjkl ΔQj ,k,l−1

− A− n − n
jkl ΔQj+1,k,l − Bjkl ΔQj ,k+1,l 5. Implementation and results
− 
− Cjkl ΔQn
j ,k,l+1 . (5) The first stage of the work, implementing the Ja-
cobi algorithm on the CPU using 64-bit arithmetic, was
Here we could envision assigning a thread of compu- straightforward. We compare in Fig. 1 convergence for
tation to each grid point and the threads could com- the SSOR and Jacobi algorithms on two test cases. The

Fig. 1. SSOR and Jacobi convergence, 64-bit arithmetic.


5192, 2010, 3-4, Downloaded from https://onlinelibrary.wiley.com/doi/10.3233/SPR-2010-0309 by City University Of Hong Kong, Wiley Online Library on [26/06/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
D.C. Jespersen / Acceleration of a CFD code with a GPU 197

first test case is turbulent flow over a flat plate with a gorithms. This verifies for these cases that full 64-bit
121 × 41 × 81 grid. The second flow is turbulent flow solution accuracy can be obtained with a 32-bit Jacobi
in a curved duct with a 166 × 31 × 49 grid. Both cases algorithm.
show, unsurprisingly, that asymptotic convergence of Finally, the Jacobi algorithm was implemented on
the Jacobi algorithm is slightly slower than that of the the GPU. The strategy was to compute all the matri-
SSOR algorithm. Both cases reach machine zero (so- ces A+ , etc., on the CPU and transfer them to the
lution converged to 64-bit accuracy). The SSOR algo- GPU. The Jacobi algorithm itself, just one subroutine,
rithm is slightly faster in terms of wallclock seconds was hand-translated into CUDA code. This strategy
per time step, because the SSOR algorithm updates ΔQ avoided a long error-prone translation of many Fortran
as the computation proceeds whereas the Jacobi algo- subroutines into CUDA, but this strategy may be sub-
rithm uses an extra array to store the changes to ΔQ optimal as the matrices themselves could be computed
and then sweeps through the full ΔQ array to form the on the GPU. We found no difference between conver-
new values of ΔQ. gence of the 32-bit Jacobi algorithm on the CPU and on
Implementing the Jacobi algorithm in 32-bit arith- the GPU, so the slight differences in details of floating-
metic for the CPU was tedious but straightforward. point arithmetic between the CPU and the GPU have
The implementation included making 32-bit versions no impact for these cases.
of all the subroutines dealing with the computation of Now we consider performance of the code. The met-
the left-hand side matrices (about 50 subroutines) and ric we use is wallclock seconds per step, so smaller
copying, on the front-end, the flow variables and met- is better. The GPU algorithm was coded in several
ric terms to 32-bit quantities; in all, 28 words per grid slightly different ways, varying in the data layout on
point were copied from 64-bit to 32-bit representation. the GPU and whether or not shared memory on the
In Fig. 2 we show convergence for the Jacobi algorithm GPU was used. Data shown are for the best-performing
in 64-bit arithmetic and in 32-bit arithmetic for the two GPU variant.
test cases. To plotting accuracy there is no difference in The work here was done on two platforms. The first
convergence between the 32-bit and 64-bit Jacobi al- platform was a workstation equipped with a 2.1 GHz

Fig. 2. Jacobi convergence, 64-bit and 32-bit arithmetic.


5192, 2010, 3-4, Downloaded from https://onlinelibrary.wiley.com/doi/10.3233/SPR-2010-0309 by City University Of Hong Kong, Wiley Online Library on [26/06/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
198 D.C. Jespersen / Acceleration of a CFD code with a GPU

Table 1 Table 3
Implicit solver times (lower is better) GPU kernel and data transfer times, s/step (lower is better)
Algorithm G machine T machine 8800 GTX Tesla C1060

Plate Duct Plate Duct Plate Duct Plate Duct


SSOR CPU (s/step) 3.51 2.14 3.83 2.33 GPU total 0.904 0.576 0.314 0.193
Jacobi GPU (s/step) 1.43 0.91 1.35 0.76 GPU kernel only 0.784 0.499 0.142 0.082
GPU/CPU ratio 0.41 0.43 0.35 0.33 Data transfer 0.120 0.077 0.172 0.111

Table 2
overall code performance was a kernel which mapped
Total time for CPU and GPU (lower is better)
each grid point to a different thread on the GPU (thanks
Algorithm G machine T machine to Jonathan Cohen of NVIDIA for showing a nice
Plate Duct Plate Duct way to do this) and which used some shared mem-
SSOR CPU (s/step) 6.96 4.21 7.93 4.85 ory. For the Tesla device, the kernel which gave the
Jacobi GPU (s/step) 4.41 2.66 5.04 3.12 best overall code performance was a kernel involving a
GPU/CPU ratio 0.63 0.63 0.64 0.64 two-dimensional mapping of the first two grid dimen-
sions onto the device, a loop in the 3rd dimension and
quad-core AMD Opteron 2352 processor. The host 16 threads per grid point with the 5 × 5 matrices on
compiler system was the Portland Group compiler the CPU stored in an array of size 32. For both the
suite version 8. The GPU card was a 1.35 GHz Tesla and the 8800 GTX devices there are data lay-
NVIDIA GeForce 8800 GTX with 128 cores and outs which give better performance of the GPU consid-
768 MB of global memory. The connection between ered in isolation, but these layouts involve data motion
CPU and GPU was a PCI Express 16X bus. The pro- on the CPU and this data motion loses more wallclock
gramming interface was CUDA version 1.0. This plat- time than is gained by the faster kernel. From this ta-
form will be referred to as the “G machine”. ble it can be seen that the time for data transfer to and
The second platform was a workstation equipped from the GPU is relatively small for the 8800 GTX but
with two 2.8 GHz dual-core AMD Opteron 2220 is significant (more than the time for the kernel itself)
processors. The GPU card was a 1.30 GHz NVIDIA for the C1060.
Tesla C1060 with 240 cores and 4 GB of global mem-
ory. For this machine, the source code was cross-
compiled on the first machine using the Portland Group 6. Impact of GPU work on CPU code
compiler. This platform will be referred to as the
“T machine”.
These results are encouraging. It seems that the Ja-
Tables 1 and 2 give performance data for the two test
cases on the two machines. The implicit solver times in cobi GPU algorithm is significantly faster than the
Table 1 (which include a small amount of work on the SSOR CPU algorithm, since Table 2 shows a speedup
CPU as well as the actual relaxation algorithm) show a for the whole code of about 40%. This is a significant
speedup on the GPU by about a factor of between 2.5 speedup considering that the only code being executed
and 3. The reason the T machine times are only slightly on the GPU is a small piece of the implicit side and
better than the G machine times is that the times shown there are no changes to the flow explicit side or to the
here include some CPU work and for some unknown turbulence model.
reason the CPU routines involved ran faster on a sin- However, further reflection indicates that more per-
gle CPU of the G machine than on a single CPU of the formance can be gained from the CPU. Specifical-
T machine. The wallclock time for the full code, which ly, the SSOR CPU algorithm could be changed to
is the quantity of ultimate interest to the code user, de- use 32-bit arithmetic. Experience has shown that
creases by about 40%, as seen in Table 2. Again, a sin- OVERFLOW is a bandwidth-limited code, so use of
gle CPU of the G machine is overall faster than a single 32-bit arithmetic (where applicable) should signifi-
CPU of the T machine. cantly speed up the code due to the effective doubling
Table 3 gives GPU total time (kernel plus time for of bandwidth. This motivated the implementation of
data transfer) and GPU kernel time for these cases. For a 32-bit SSOR algorithm on the CPU. Convergence
the 8800 GTX device, the kernel which gave the best for this algorithm, for the plate and duct test cases,
5192, 2010, 3-4, Downloaded from https://onlinelibrary.wiley.com/doi/10.3233/SPR-2010-0309 by City University Of Hong Kong, Wiley Online Library on [26/06/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
D.C. Jespersen / Acceleration of a CFD code with a GPU 199

Table 4 Table 6
SSOR performance on CPU, 64-bit and 32-bit (lower is better) Jacobi on GPU + OpenMP on CPU, performance (lower is better)
Algorithm G machine T machine No. of 8800 GTX Tesla C1060
OpenMP
Plate Duct Plate Duct threads Plate Duct Plate Duct
SSOR-64 CPU (s/step) 6.96 4.21 7.93 4.85 on CPU (s/step) (s/step) (s/step) (s/step)
SSOR-32 CPU (s/step) 5.55 3.34 6.30 3.87 1 4.39 2.66 4.58 2.85
SSOR-32/SSOR-64 ratio 0.80 0.79 0.79 0.80 2 3.10 1.78 2.92 1.83
4 2.32 1.42 1.90 1.18
Table 5
SSOR and OpenMP performance on CPU, s/step (lower is better) ious numbers of OpenMP threads; these are all CPU
Algorithm G machine T machine
performance numbers, the GPU is not involved here,
OpenMP
threads and this is performance for the full code.
Plate Duct Plate Duct For a single OpenMP thread the revised SSOR algo-
SSOR-64 1 6.96 4.21 7.93 4.85 rithm is slower than the original, due to poorer cache
SSOR-64 2 5.53 3.27 6.76 4.11 utilization, but for 2 or 4 OpenMP threads the revised
SSOR-64 4 4.60 2.80 5.94 3.60 algorithm is faster than the original.
Finally, Table 6 gives performance data for the two
Revised SSOR-64 1 7.79 4.70 8.41 5.14
platforms using the Jacobi algorithm on the GPU and
Revised SSOR-64 2 4.79 2.85 4.76 2.96
OpenMP parallelism on the CPU. To compare the code
Revised SSOR-64 4 3.36 2.04 3.27 1.99
with and without GPU but otherwise using all the
SSOR-32 1 5.55 3.34 6.30 3.87 computational resources, the best numbers in Table 5
SSOR-32 2 4.23 2.55 4.72 2.89 should be compared with the best numbers in Table 6.
SSOR-32 4 3.45 2.11 3.92 2.41 The result is that for the workstation with the 8800
GTX GPU, the best time with GPU is just a few per-
Revised SSOR-32 1 6.07 3.65 6.92 4.20 cent faster than the best time without GPU, whereas
Revised SSOR-32 2 3.63 2.18 3.79 2.37 for the workstation with the Tesla C1060 GPU, the best
Revised SSOR-32 4 2.38 1.47 2.50 1.52 time with GPU is about 25% better than the best time
without GPU. The ultimate reason for the better per-
was identical to convergence for the 64-bit SSOR al- formance of the Tesla C1060 on this code is the relaxed
gorithm. Single-core performance for the 32-bit SSOR alignment restrictions for coalesced loads as compared
algorithm is compared with performance for the 64-bit to the 8800 GTX.
SSOR algorithm in Table 4. Total wallclock time is re- Even this is not the end of the story, as there are
duced by 20% by the use of 32-bit arithmetic; this is further opportunities for moving computation from the
a significant speedup for such an unsophisticated code CPU to the GPU. For example, the matrices can be
modification. computed in parallel, so this part of the computation,
Even more performance can be gained on the which is now executed by the CPU, can be moved to
CPU by taking advantage of multiple CPU cores. the GPU. These and other optimizations are currently
OVERFLOW has long had OpenMP capability, thereby under investigation.
using multiple CPU cores, but the SSOR algorithm as
originally coded in OVERFLOW was not amenable to 7. Conclusions
OpenMP parallelism (as mentioned in Section 4, the
original coding was Jacobi in the j index and Gauss– The work presented in this paper has shown a
Seidel in the k and l indices, while the OpenMP par- speedup by a factor of between 2.5 and 3 for the SSOR
allelism in OVERFLOW is parallelism in l). Revising solver in OVERFLOW, and a total wallclock time de-
the algorithm to be Jacobi in l and Gauss–Seidel in j crease of about 40%, for a GPU as compared to a sin-
and k allowed the use of OpenMP for the SSOR algo- gle CPU. The GPU work gave ideas and motivation for
rithm. The revised SSOR algorithm had the same con- accelerating the code on multi-core CPUs, so that cur-
vergence characteristics as the Jacobi algorithm for the rently the CPU + GPU code is about 25% faster than
duct and plate test cases. The revised algorithm can the multi-core CPU code.
also be coded in 64-bit or 32-bit arithmetic. We show This study has until now focused on obtaining im-
in Table 5 performance for these algorithms and var- proved performance with a one CPU + one GPU com-
5192, 2010, 3-4, Downloaded from https://onlinelibrary.wiley.com/doi/10.3233/SPR-2010-0309 by City University Of Hong Kong, Wiley Online Library on [26/06/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
200 D.C. Jespersen / Acceleration of a CFD code with a GPU

bination. This is the first step to enhancing OVERFLOW This could be written in CUDA as follows:
performance via GPUs on realistic problems. How- __global__
ever, for almost all realistic cases, OVERFLOW is void p_add_vecs(int n, float *x,
used with MPI (Message-Passing Interface) and many float *y, float *z) {
int myi = blockIdx.x*blockDim.x
CPUs. The work here extends naturally to any clus- + threadIdx.x;
ter with a number of multi-core nodes, each node also if (myi < n) z[myi] = x[myi] + y[myi];
containing a GPU. OVERFLOW could be used in hy- }
brid mode, with each node corresponding to an MPI with calling code
process, and each MPI process would have multiple
OpenMP threads. The hyperwall-2 at NASA/Ames Re- void parallel_add_vecs {
// define scalar n; allocate and
search Center [8] is such a cluster and the version // initialize arrays x, y, z
of OVERFLOW with GPU capability has run on the // transfer arrays x, y, z
hyperwall-2 as a proof of concept. // from CPU to GPU
It is worthwhile to note that the work done here int nThreads = 256;
has affected the OVERFLOW code. Official release ver- int nBlocks = ceil(n/nThreads);
p_add_vecs <<<nBlocks,nThreads>>>
sion 2.1ac of OVERFLOW completely abandoned the (n, x, y, z);
64-bit version of the SSOR algorithm in favor of the }
32-bit version, and the revised SSOR algorithm (32-bit
arithmetic only) is available as an option. The speedups The arguments between the triple angle brackets de-
due to 32-bit arithmetic were so compelling that 64-bit fine the execution configuration of the function call.
arithmetic is no longer even an option in these portions The first argument is a structure specifying the layout
of the code. This may give food for thought when con- of the grid of blocks; in this simple case an integer
sidering the need for 64-bit arithmetic on GPUs. is silently promoted to a structure specifying a one-
dimensional layout. The second argument is a struc-
ture specifying the thread layout within each block;
Acknowledgements again an integer is a shortcut for a one-dimensional lay-
out. So if n = 100,000 there would be 391 blocks of
This work was partially supported by the Fundamen- 256 threads each. The __global__ keyword defines
tal Aeronautics Program of NASA’s Aeronautics Re- p_add_vecs as a function which will execute on the
search Mission Directorate. NVIDIA Corporation do- GPU. The location of the variables x, y, z is not
nated the Tesla C1060 hardware. specified, so by default they reside in the global mem-
ory of the GPU. The structures blockIdx, block-
Dim and threadIdx are automatically defined when
Appendix the kernel function is called and serve to define the log-
ical organization of the device. Given these structures,
As an example of a CUDA program, consider the each thread computes an index myi into the global ar-
simple vector addition z = x + y where x, y and z are rays, and if that index is less than the dimension of the
vectors of dimension n. In standard C, this might be arrays, the thread does work. With n = 100,000 there
written as: are 100,096 threads and the last 96 threads do no work,
void s_add_vecs(int n, float *x,
but this is much less than 1% of the total number of
float *y, float *z) { threads. The number 256 of threads here is just one
int i; possibility. Each particular model of NVIDIA card has
for (i=0; i<n; i++) defined upper limits on the grid and block dimensions
z[i] = x[i] + y[i]; and the number of threads per block. In addition, each
}
thread requires some hardware resources, so for any
and called via given kernel there may be further restrictions. Mean-
while, each block should have “many” threads for ef-
void serial_add_vecs() {
// define scalar n; allocate and fective parallel use of the available resources. Opti-
// initialize arrays x, y, z mal choice of the number of threads typically requires
s_add_vecs(n, x, y, z); some experimentation for each particular problem. In
} our CFD application, the best choice for number of
5192, 2010, 3-4, Downloaded from https://onlinelibrary.wiley.com/doi/10.3233/SPR-2010-0309 by City University Of Hong Kong, Wiley Online Library on [26/06/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
D.C. Jespersen / Acceleration of a CFD code with a GPU 201

threads and blocks might depend on the dimensions of Navier–Stokes code, in: AIAA 25th Fluid Dynamics Confer-
the CFD spatial grid. ence, Colorado Springs, CO, USA, 1994, paper no. AIAA-94-
2357.
[12] R.L. Meakin, Automatic off-body grid generation for domains
of arbitrary size, in: AIAA 15th Computational Fluid Dynamics
References Conference, Anaheim, CA, USA, 2001, paper no. AIAA-2001-
2536.
[1] R. Beam and R.F. Warming, An implicit finite-difference al- [13] J. Michalakes and M. Vachharajani, GPU acceleration of nu-
gorithm for hyperbolic systems in conservation law form, merical weather prediction, Parallel Processing Letters 18(4)
J. Comp. Physics 22(1) (1976), 87–110. (2008), 531–548.
[2] T. Brandvik and G. Pullan, Acceleration of a 3D Euler solver [14] R.H. Nichols, R.W. Tramel and P.G. Buning, Solver and tur-
using commodity graphics hardware, in: AIAA 46th Aerospace bulence model upgrades to OVERFLOW 2 for unsteady and
Sciences Meeting, Reno, NV, USA, 2008, paper no. AIAA- high-speed applications, in: AIAA 24th Applied Aerodynamics
2008-607. Conference, San Francisco, CA, USA, 2006, paper no. AIAA-
[3] I. Buck, T. Foley, D. Horn, J. Suferman, K. Fatahalian, 2006-2824.
M. Houston and P. Hanrahan, Brook for GPUs: stream com- [15] T.H. Pulliam and D.S. Chaussee, A diagonalized form of an
puting on graphics hardware, in: SIGGRAPH 2004, 2004. implicit approximate factorization algorithm, J. Comp. Phys.
[4] P.G. Buning, I.T. Chiu, S. Obayashi, Y.M. Rizk and J.L. Steger, 39(2) (1981), 347–363.
Numerical simulation of the integrated space shuttle vehicle [16] K.J. Renze, P.G. Buning and R.G. Rajagopalan, A comparative
in ascent, in: AIAA Atmospheric Flight Mechanics Conference, study of turbulence models for overset grids, in: AIAA 30th
1988, paper no. 88-4359-CP. Aerospace Sciences Meeting, Reno, NV, USA, 1992, paper no.
[5] http://www.nvidia.com/object/cuda_home.html. AIAA-92-0437.
[6] http://www-unix.mcs.anl.gov/mpi. [17] J.C. Thibault and I. Senocak, CUDA implementation of a
[7] http://www.openmp.org. Navier–Stokes solver on multi-GPU desktop platforms for in-
[8] http://www.nas.nasa.gov/News/Releases/2008/06-25-08.html. compressible flows, in: AIAA 47th Aerospace Sciences Meet-
[9] http://www.khronos.org/opencl. ing, Orlando, FL, USA, 2009, paper no. AIAA-2009-758.
[10] http://graphics.stanford.edu/projects/brookgpu.
[11] W. Kandula and P.G. Buning, Implementation of LU-SGS al-
gorithm and roe upwinding scheme in OVERFLOW thin-layer

You might also like