Automatic GPU-CPU Communication Management & Optimization
Automatic GPU-CPU Communication Management & Optimization
Lee [12]
Semi-
JCUDA [26]
PGI [24] void foo(unsigned N) {
/* Copy elements from array to the GPU */
char *h d array[M];
for(unsigned i = 0; i < M; ++i) {
CUDA [16] Baskaran [3] size t size = strlen(h h array[i]) + 1;
Manual
Manually copying complex data-types from CPU memory to GPU Naïve Inspector−Executor Acyclic
memory is tedious and error-prone. Listing 1 shows how a CUDA Host Device Host Device Host Device
Time
1 1 1
code in the listing involves communication management and not
2
useful computation. Furthermore, the programmer must manage
3
buffers and manipulate pointers. Buffer management and pointer
2 4
manipulation are well-known sources of bugs. 2
5
Automatic communication management avoids the difficulties
6
of buffer management and pointer manipulation, improving pro-
gram correctness and programmer efficiency. However, automati- 3
3
Parallel Code
the run-time library to translate pointers.
Listing 3: Listing 2 after the compiler inserts run-time functions Algorithm 1: Pseudo-code for map
(unoptimized CGCM). Require: ptr is a CPU pointer
char *h h array[M] = { Ensure: Returns an equivalent GPU pointer
“What so proudly we hailed at the twilight’s last gleaming,”, info ← greatestLTE(allocInfoMap, ptr)
... if info.refCount = 0 then
}; if ¬info.isGlobal then
info.devptr ← cuMemAlloc(info.size)
global void kernel(unsigned i, char **d array); else
info.devptr ← cuModuleGetGlobal(info.name)
void foo(unsigned N) { cuMemcpyHtoD(info.devptr, info.base, info.size)
for(unsigned i = 0; i < N; ++i) {
char **d d array = mapArray(h h array);
info.refCount ← info.refCount + 1
kernel<<<30, 128>>>(i, d d array); return info.devptr + (ptr − info.base)
unmapArray(h h array);
releaseArray(h h array);
}
} Algorithm 2: Pseudo-code for unmap
Require: ptr is a CPU pointer
Useful work Communication Kernel spawn Ensure: Update ptr with GPU memory
info ← greatestLTE(allocInfoMap, ptr)
if info.epoch 6= globalEpoch ∧ ¬info.isReadOnly then
cuMemcpyDtoH(base, info.devptr, info.size)
Each of the primary run-time library functions has an array
variant. The array variants of the run-time library functions have info.epoch ← globalEpoch
the same semantics as their non-array counterparts but operate on
doubly indirect pointers. The array mapping function translates
each CPU memory pointer in the original array into a GPU memory When copying heap or stack allocation units to the GPU, map
pointer in a new array. It then maps the new array to GPU memory. dynamically allocates GPU memory, but global variables must be
Using run-time library calls, Listing 2 can be rewritten as Listing 3. copied into their associated named regions. The map function calls
cuModuleGetGlobal with the global variable’s name to get the
3.3 Implementation variable’s address in GPU memory. After increasing the reference
The map, unmap, and release functions provide the basic func- count, the function returns the equivalent pointer to GPU memory.
tionality of the run-time library. The array variations follow the The map function preserves aliasing relations in GPU memory,
same patterns as the scalar versions. since multiple calls to map for the same allocation unit yield point-
Algorithm 1 is the pseudo-code for the map function. Given ers to a single corresponding GPU allocation unit. Aliases are com-
a pointer to CPU memory, map returns the corresponding pointer mon in C and C++ code and alias analysis is undecidable. By han-
to GPU memory. The allocaInfoMap contains information about dling pointer aliases in the run-time library, the compiler avoids
the pointer’s allocation unit. If the reference count of the allocation static analysis, simplifying implementation and improving applica-
unit is non-zero, then the allocation unit is already on the GPU. bility.
Algorithm 3: Pseudo-code for release Algorithm 4: Pseudo-code for map promotion
Require: ptr is a CPU pointer forall region ∈ Functions ∪ Loops do
Ensure: Release GPU resources when no longer used forall candidate ∈ findCandidates(region) do
info ← greatestLTE(allocInfoMap, ptr) if ¬pointsToChanges(candidate, region) then
info.refCount ← info.refCount − 1 if ¬modOrRef(candidate, region) then
if info.refCount = 0 ∧ ¬info.isGlobal then copy(above(region), candidate.map)
cuMemFree(info.devptr) copy(below(region), candidate.unmap)
copy(below(region), candidate.release)
deleteAll(candidate.unmap)
256x
128x Inspector-Executor
64x Unoptimized CGCM
32x Optimized CGCM
Manual
16x
8x
4x
2x
1x
0.5x
0.25x
adi
atax
bicg
correlation
convariance
doitgen
gemm
gemver
gesumv
gramschmidt
jacobi-2d-imper
lu
ludcmp
seidel
2mm
3mm
cfd
hotspot
kmeans
lud
nw
srad
fm
blackscholes
geomean
Table 3. Summary of program characteristics including: program suite, limiting factor for performance, the contributions of GPU and
communication time to total execution time as a percentage, the number of applicable kernels for the CGCM, Inspector-Executor, and
Named Region management techniques, and a citation for prior manual parallelizations.
mized CGCM, and a manual parallelization if one exists. The fig- 0.71x for unoptimized CGCM, and 5.36x for optimized CGCM.
ure’s y-axis starts at 0.25x, although some programs have lower Taking the greater of 1.0x or the performance of each application
speedups. Table 3 shows additional details for each program. The yields geomeans of 1.53x for inspector-executor, 2.81x for unopti-
geomean whole program speedups over sequential CPU only exe- mized CGCM, and 7.18x for optimized CGCM.
cution across all 24 applications are 0.92x for inspector-executor,
Before optimization, most programs show substantial slow- sequential CPU-only execution. These applications have reached
down. The srad program has a slowdown of 4,437x and nw has the limit of Amdahl’s Law for the current parallelization.
a slowdown of 1,126x. By contrast, ludcmp’s slowdown is only The manual Rodinia parallelizations involved complex algorith-
4.54x. After optimization, most programs show performance im- mic improvements. For example, in hotspot the authors replace
provements and none have worse performance. However, several the original grid-based simulation with a simulation based on the
fail to surpass the CPU-only sequential versions. For comparison, pyramid method. Surprisingly, the simple automatic parallelization
we simulate an idealized inspector-executor system. The inspector- coupled with CGCM is competitive with expert programmers using
executor system has an oracle for scheduling and transfers exactly algorithmic transformations. Table 3 explains why. Programmers
one byte between CPU and GPU for each accessed allocation unit. tend to optimize a program’s hottest loops, but ignore the second
A compiler creates the inspector from the original loop [4]. To mea- and third tier loops which become important once the hottest loops
sure performance ignoring applicability constraints, the inspector- scale to thousands of threads. Automatic GPU parallelizations sub-
executor simulation ignores its applicability guard. CGCM outper- stitutes quantity for quality, profiting from Amdahl’s Law.
forms this idealized inspector-executor system. The disadvantages
of sequential inspection and frequent synchronization were not 7. Related Work
overcome by transferring dramatically fewer bytes.
Figure 4 shows performance results for automatic paralleliza- Although there has been prior work on automatic paralleliza-
tion coupled with automatic communication management. Across tion and semi-automatic communication management for GPUs,
all 24 applications, communication optimizations never reduce per- these implementations have not addressed the problems of fully-
formance. This is a surprising result since the glue kernel optimiza- automatic communication management and optimization.
tion has the potential to lower performance, and CGCM’s imple- CUDA-lite [25] translates low-performance, naı̈ve CUDA func-
mentation lacks a performance guard. Communication optimiza- tions into high performance code by coalescing and exploiting GPU
tion improves the performance for five of the sixteen PolyBench shared memory. However, the programmer must insert transfers to
programs and six of the eight other programs. For many PolyBench the GPU manually.
programs, the outermost loop executes on the GPU, so there are no “C-to-CUDA for Affine Programs” [3] and “A mapping path
loops left on the CPU for map promotion to target. Therefore, op- for GPGPU” [13] automatically transform programs similar to the
timization only improve performance for six of the 16 PolyBench PolyBench programs into high performance CUDA C using the
programs. polyhedral model. Like CUDA-lite, they require the programmer
Table 3 shows the number of GPU kernels created by the to manage memory.
DOALL parallelizer. For each DOALL candidate, CGCM auto- “OpenMP to GPGPU” [12] proposes an automatic compiler
matically managed communication correctly without programmer for the source-to-source translation of OpenMP applications into
intervention. Unlike CGCM, the parallelizer requires static alias CUDA C. Most programs do not have OpenMP annotations. Fur-
analysis. In practice, CGCM is more applicable than the simple thermore, these annotations are time consuming to add and not
DOALL transformation pass. performance portable. Their system automatically transfers named
The table also shows the applicability of named regions [12] and regions between CPU and GPU using two passes. The first pass
inspector-executor management systems [4, 14, 22]. Affine com- copies all named annotated regions to the GPU for each GPU func-
munication management [24] has the same applicability as named tion, and the second cleanup pass removes all the copies that are
regions, but a different implementation. The named region and not live-in. The two passes acting together produce a communica-
inspector-executor techniques require that each of the live-ins is tion pattern equivalent to unoptimized CGCM communication.
a distinct named allocation unit. The named regions technique also JCUDA [26] uses the Java type system to automatically trans-
requires induction-variable based array indexes. The named region fer GPU function arguments between CPU and GPU memories.
and inspector-executor systems are applicable to 66 of 67 kernels in JCUDA requires an annotation indicating whether each parameter
the PolyBench applications. However, they are applicable to only is live-in, live-out, or both. Java implements multidimensional ar-
14 of 34 kernels from the more complex non-PolyBench applica- rays as arrays of references. JCUDA uses type information to flatten
tions. Although inspector-executor and named region based tech- these arrays to Fortran-style multidimensional arrays but does not
niques have different applicability guards, they both fail to transfer support recursive data-types.
memory for the same set of kernels. The PGI Fortran and C compiler [24] features a mode for semi-
Table 3 shows the GPU execution and communication time as automatic parallelization for GPUs. Users target loops manually
a percent of total execution time. The contributions of CPU exe- with a special keyword. The PGI compiler can automatically trans-
cution and IO are not shown. This data indicates the performance fer named regions declared with C99’s restrict keyword to the
limiting factor for each program: either GPU execution, commu- GPU and back by determining the range of affine array indices.
nication, or some other factor (CPU or IO). GPU execution time The restrict keyword marks a pointer as not aliasing with other
dominates total execution time for 13 programs, ten from Poly- pointers. By contrast, CGCM is tolerant of aliasing and does not
bench and three from other applications. The simpler PolyBench require programmer annotations. The PGI compiler cannot paral-
programs are much more likely to be GPU performance bound than lelize loops containing general pointer arithmetic, while CGCM
the other more complex programs. GPU-bound programs would preserves the semantics of pointer arithmetic. Unlike CGCM, the
benefit from more efficient parallelizations, perhaps using the poly- PGI compiler does not automatically optimize communication
hedral model. Communication limits the performance of five pro- across GPU function invocations. However, programmers can use
grams, all from PolyBench. The only application where inspector- an optional annotation to promote communication out of loops.
executor outperforms CGCM, gramschmidt, falls in this category. Incorrectly using this annotation will cause the program to access
Finally, six programs, one from PolyBench and five from else- stale or inconsistent data.
where, are neither communication nor GPU performance bound. Inspector-executor systems [18, 21] create specialized inspec-
Improving the performance of these applications would require par- tors to identify precise dependence information among loop iter-
allelizing more loops. Two of the applications that are neither GPU ations. Some inspector-executor systems achieve acyclic commu-
nor communication-bound, srad and blackscholes, outperform nication when dynamic dependence information is reusable. This
condition is rare in practice. Salz et al. assume a program annota-
tion to prevent unsound reuse [21]. Rauchwerger et al. dynamically
check relevant program state to determine if dependence informa- [6] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston,
tion is reusable [18]. The dynamic check requires expensive se- and P. Hanrahan. Brook for GPUs: Stream computing on graphics
quential computation for each outermost loop iteration. If the check hardware. ACM Transactions on Graphics, 23, 2004.
fails, the technique defaults to cyclic communication. [7] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee,
and K. Skadron. Rodinia: A benchmark suite for heterogeneous
computing. 2009.
8. Conclusion [8] D. M. Dang, C. Christara, and K. Jackson. GPU pricing of exotic
CGCM is the first fully automatic system for managing and opti- cross-currency interest rate derivatives with a foreign exchange volatil-
ity skew model. SSRN eLibrary, 2010.
mizing CPU-GPU communication. CPU-GPU communication is a [9] P. Feautrier. Some efficient solutions to the affine scheduling problem:
crucial problem for manual and automatic parallelizations. Manu- I. one-dimensional time. International Journal of Parallel Program-
ally transferring complex data-types between CPU and GPU mem- ming (IJPP), 1992.
ories is tedious and error-prone. Cyclic communication constrains [10] D. R. Horn, M. Houston, and P. Hanrahan. Clawhmmer: A streaming
the performance of automatic GPU parallelizations. By managing HMMer-Search implementation. Proceedings of the Conference on
and optimizing CPU-GPU communication, CGCM eases manual Supercomputing (SC), 2005.
GPU parallelizations and improves the performance and applica- [11] Khronos Group. The OpenCL Specification, September 2010.
bility of automatic GPU parallelizations. [12] S. Lee, S.-J. Min, and R. Eigenmann. OpenMP to GPGPU: a compiler
framework for automatic translation and optimization. In Proceed-
CGCM has two parts, a run-time library and an optimizing com-
ings of the Fourteenth ACM SIGPLAN Symposium on Principles and
piler. The run-time library’s semantics allow the compiler to man- Practice of Parallel Programming (PPoPP), 2009.
age and optimize CPU-GPU communication without programmer [13] A. Leung, N. Vasilache, B. Meister, M. M. Baskaran, D. Wohlford,
annotations or heroic static analysis. The compiler breaks cyclic C. Bastoul, and R. Lethin. A mapping path for multi-GPGPU accel-
communication patterns by transferring data to the GPU early in erated computers from a portable high level programming abstraction.
the program and retrieving it only when necessary. CGCM out- In Proceedings of the 3rd Workshop on General-Purpose Computation
performs inspector-executor systems on 24 programs and enables on Graphics Processing Units (GPGPU), pages 51–61, 2010.
a whole program geomean speedup of 5.36x over best sequential [14] S.-J. Min and R. Eigenmann. Optimizing irregular shared-memory
CPU-only execution. applications for clusters. In Proceedings of the 22nd Annual Interna-
tional Conference on Supercomputing (SC). ACM, 2008.
[15] NVIDIA Corporation. CUDA C Best Practices Guide 3.2, 2010.
Acknowledgments [16] NVIDIA Corporation. NVIDIA CUDA Programming Guide 3.0,
February 2010.
We thank the entire Liberty Research Group for their support and
[17] L.-N. Pouchet. PolyBench: The Polyhedral Benchmark suite.
feedback during this work. We also thank Helge Rhodin for gen- http://www-roc.inria.fr/ pouchet/software/polybench/download.
erously contributing his PTX backend. Additionally, we thank the [18] L. Rauchwerger, N. M. Amato, and D. A. Padua. A scalable method
anonymous reviewers for their insightful comments. for run-time loop parallelization. International Journal of Parallel
This material is based on work supported by National Science Programming (IJPP), 26:537–576, 1995.
Foundation Grants 0964328 and 1047879, and by United States Air [19] H. Rhodin. LLVM PTX Backend.
Force Contract FA8650-09-C-7918. James A. Jablin is supported http://sourceforge.net/projects/llvmptxbackend.
by a Department of Energy Office of Science Graduate Fellowship [20] S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and
(DOE SCGF). W.-m. W. Hwu. Optimization principles and application performance
evaluation of a multithreaded GPU using CUDA. In Proceedings of
the Thirteenth ACM SIGPLAN Symposium on Principles and Practice
References of Parallel Programming (PPoPP), 2008.
[1] ISO/IEC 9899-1999 Programming Languages – C, Second Edition, [21] J. Saltz, R. Mirchandaney, and R. Crowley. Run-time parallelization
1999. and scheduling of loops. IEEE Transactions on Computers, 40, 1991.
[2] C. Ancourt and F. Irigoin. Scanning polyhedra with DO loops. In [22] S. D. Sharma, R. Ponnusamy, B. Moon, Y.-S. Hwang, R. Das, and
Proceedings of the Third ACM SIGPLAN Symposium on Principles J. Saltz. Run-time and compile-time support for adaptive irregular
and Practice of Parallel Programming (PPoPP), 1991. problems. In Proceedings of the Conference on Supercomputing (SC).
[3] M. M. Baskaran, J. Ramanujam, and P. Sadayappan. Automatic C-to- IEEE Computer Society Press, 1994.
CUDA code generation for affine programs. In Compiler Construction [23] StreamIt benchmarks. http://compiler.lcs.mit.edu/streamit.
(CC), 2010. [24] The Portland Group. PGI Fortran & C Accelator Programming Model.
[4] A. Basumallik and R. Eigenmann. Optimizing irregular shared- White Paper, 2010.
memory applications for distributed-memory systems. In Proceedings [25] S.-Z. Ueng, M. Lathara, S. S. Baghsorkhi, and W.-m. W. Hwu. CUDA-
of the Eleventh ACM SIGPLAN Symposium on Principles and Practice Lite: Reducing GPU Programming Complexity. In Proceeding of the
of Parallel Programming (PPoPP), 2006. 21st International Workshop on Languages and Compilers for Parallel
[5] C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark Computing (LCPC), 2008.
suite: characterization and architectural implications. In Proceedings [26] Y. Yan, M. Grossman, and V. Sarkar. JCUDA: A programmer-friendly
of the 17th International Conference on Parallel Architectures and interface for accelerating Java programs with CUDA. In Proceedings
Compilation Techniques (PACT), 2008. of the 15th International Euro-Par Conference on Parallel Processing.
Springer-Verlag, 2009.