Micikevicius, P. - 3D Finite DIfference Computation On GPUs Using CUDA
Micikevicius, P. - 3D Finite DIfference Computation On GPUs Using CUDA
Paulius Micikevicius
NVIDIA 2701 San Tomas Expressway Santa Clara, CA 95050
ABSTRACT
In this paper we describe a GPU parallelization of the 3D finite difference computation using CUDA. Data access redundancy is used as the metric to determine the optimal implementation for both the stencil-only computation, as well as the discretization of the wave equation, which is currently of great interest in seismic computing. For the larger stencils, the described approach achieves the throughput of between 2,400 to over 3,000 million of output points per second on a single Tesla 10-series GPU. This is roughly an order of magnitude higher than a 4-core Harpertown CPU running a similar code from seismic industry. Multi-GPU parallelization is also described, achieving linear scaling with GPUs by overlapping inter-GPU communication with computation.
i =1
The paper is organized as follows. Section 2 reviews the CUDA programming model and GPU architecture. CUDA implementation of the 3D stencil computation is described in Section 3. Performance results are presented in Section 4. Section 5 includes the conclusions and some directions for future work.
General Terms
Algorithms, Performance, Measurement.
Keywords
Finite Difference, GPU, CUDA, Parallel Algorithms.
1. INTRODUCTION
In this paper we describe a parallelization of the 3D finite difference computation, intended for GPUs and implemented using NVIDIAs CUDA framework. The approach utilizes thousands of threads, traversing the volume slice-by-slice as a 2D front of threads in order to maximize data reuse from shared memory. GPU performance is measured for the stencil-only computation (Equation 1), as well as for the finite difference discretization of the wave equation. The latter is the basis for the reverse time migration algorithm (RTM) [6] in seismic computing. An order-k in space stencil refers to a stencil that requires k input elements in each dimension, not counting the element at the intersection. Alternatively, one could refer to the 3D order-k stencil as a (3k + 1)-point stencil. Equation below defines the stencil computation for a three-dimensional, isotropic case.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. GPGPU2, March 8, 2009, Washington D.C., US. Copyright 2009 ACM 978-1-60558-517-8/09/03$5.00.
the architecture and programming model can be found in [5] [7][2]. The Tesla 10-series GPUs (Tesla S1060/S1070) contain 30 multiprocessors, each multiprocessor contains 8 streaming processors (for a total of 240), 16K 32-bit registers, and 16 KB of shared memory. Theoretical global-memory bandwidth is 102 GB/s, available global memory is 4GB.
Figure 2. 16x16 data tile and halos for order-8 stencil in shared memory The nave approach to compute an order-k stencil refetches all (3k + 1) input elements for every output value, leading to (3k + 1) read redundancy. Our implementation reduces redundancy by performing calculations from shared memory. Since 16KB of shared memory available per multiprocessor is not sufficient to store a significantly large 3D subdomain of a problem, a 2D tile is stored instead (Figure 2). Extension of the computation to the 3rd dimension is discussed in the next section. Threads are grouped into 2D threadblocks to match data tiling, assigning one thread per output element. Given an order-k stencil and nm threadblocks and output tiles, an (n + k)(m + k) shared memory array is needed to accommodate the data as well the four halo regions. Even though space for k2 elements could be saved by storing the halos separately (the four (k/2)(k/2) corners of the array are not used), savings are not significant enough to justify increased code complexity. Since halo elements are read by at least two threadblocks, the read redundancy of loading the data into shared memory arrays is (nm + kn + km)/(nm). For example, read redundancy is 2 for an order-8 stencil when using threadblocks configured as 16x16 threads (2424 shared memory array). Increasing threadblock and tile dimensions to 3232 reduces redundancy to 1.5, in this case threadblocks contain 512 threads (arranged as 32x16), each thread computing two output values. Figure 3. Element re-use by a group of threads The two passes described above can be merged into a single one. Let z be the slowest varying dimension. If we assign a thread to compute output values for a given column along z, no additional shared memory storage is needed. Threads of a given threadblock coherently traverse the volume along z, computing output for each slice. While the elements in the current slice are needed for computation by multiple threads, elements in the slices preceding and succeeding the current z-position are used only by the threads corresponding to the elements (x, y) position (Figure 3). Thus, input elements in the current slice are stored in shared memory, while each thread stores the input from the preceding/succeeding k/2 slices it needs in local variables (which in CUDA are usually placed in registers). Using the case depicted in Figure 3, the four threads would access the 32 elements in the xy-plane from shared memory, while the elements along the z axis would be stored in corresponding threads registers. Once all the threads in a threadblock write the results for the current slice, values in the local variables are shifted, reading in a new element at distance (k/2 + 1): the k local variables and shared memory are used as a queue. Output is written exactly once, input is read with (nm + kn + km)/(nm) redundancy due to halos. For example, redundancy for an order-8 stencil with 16x16 tiles is 3, compared to 6 of the two-pass approach.
k 2 t +1 t t 1 t t t t t t t Dx , y , z = 2 D x , y , z D x , y , z + v x , y , z c 0 D x , y , zk + c i D x i , y , z + D x + i , y , z + D x , y i , z + D x , y + i , z + D x , y , z i + D x , y , z + i i =1
(Equation 2)
4. EXPERIMENTAL RESULTS
This section describes experiments with two types of kernels stencil-only and finite difference of the wave equation. Performance was measured on Tesla S1070 servers, containing four GPUs each. S1070 servers were connected to cluster CPU nodes running CUDA 2.0 toolkit and driver (64-bit RHEL). Throughput in millions of output points per second (Mpoints/s) was used as the metric. Multi-GPU experiments are also included in this section, since practical working sets for the finitedifference in time domain of the wave equation exceed the 4GB memory capacity of currently available Tesla GPUs. The prototype kernels do not account for boundary conditions. In order to avoid out-of-bounds accesses, the order-k kernel does not compute output for the first and last k/2 slices in the slowest dimension, while the k/2 boundary slices in each of the remaining 4 directions are computed with data fetched at usual offsets. Consequently, the 4 boundaries are incorrect as at least one radius of the stencil extends into inappropriate data. Since the intent of this study is to determine peak performance, we chose to ignore boundary conditions. Furthermore, boundary processing varies based on application as well as implementation, making experiments with general code not feasible. While we expect the performance to decrease once boundary handling is integrated, we believe performance reported below to be a reliable indication of current Tesla GPU capabilities.
varied significantly across the configurations, memory throughput in GB/s (counting both reads and writes) was much more consistent, varying between 45 and 55 GB/s.
Difference
of
the
Wave-
Finite difference discretization of the wave equation is a major building block for the Reverse Time Migration (RTM) [6][1], a technique of great interest in seismic imaging. While RTM has been known since 1980s, until very recently its computational cost has been too high for practical purposes. Due to advances in computer architecture and the potential for higher quality results, adoption of the RTM for production seismic computing has started in the last couple of years. The stencil-only computation is easily extended to the time-domain finite difference of the wave equation, second order in time (Equation 2 above). In addition to reading data from the past two time steps, array v (inverse of velocity squared, in practice) is added to the input. Computing an output element requires (7k/2) + 4 floating point operations and 4 memory accesses, not accounting for redundancy due to halos. Therefore, ideal redundancy would be 4. CUDA source code for the 4th order in space wave equation kernel (using 16x16 tiles and threadblocks) is listed in Appendix A. Table 2. 3D FDTD (8th order in space, 2nd order in time) throughput in Mpoints/s
data dimensions dimx dimy dimz 320 320 400 480 480 480 544 544 544 640 640 400 800 800 200 tile dimensions 16x16 32x32 2,870.7 2,783.5 2,965.5 3,050.5 2,786.5 3,121.6 2,686.9 3,046.8 2,518.3 3,196.9
Performance measurements for the 8th order in space are summarized in Table 2. Two kernel versions were implemented. The first one utilizes 1616 threadblocks and output tiles (redundancy is 5). The second implementation used 3216 threadblocks to compute 3232 output tiles (redundancy is 4.5). GPU performance is roughly an order of magnitude higher than a single 4-core Harpertown Xeon, running an optimized implementation of the same computation.
For fixed volume dimensions, throughput decrease with increased orders is largely due to higher read redundancy, additional arithmetic being another contributing factor. For a fixed order in space, increased memory footprint of larger volumes affects the TLB performance (480480400 working set is 703MB, while 800800800 requires 3.8GB). While throughput in Mpoints/s
Table 3. Multi-GPU 3D FDTD (8th order in space, 2nd order in time) performance and scaling
Data dimensions dimx dimy dimz 480 480 800 544 544 400 544 544 800 640 640 640 640 640 800 1 GPU Mpnts/s scaling 2,986.85 1.00 2,826.35 1.00 2,736.89 1.00 2,487.17 1.00 2,433.94 1.00 2 GPUs Mpnts/s scaling 5,944.98 1.99 5,545.63 1.96 5,459.69 1.99 5,380.89 2.16 5,269.04 2.16 4 GPUs Mpnts/s scaling 11,845.90 3.97 6,453.15 2.28 11,047.20 4.04 10,298.97 4.14 10,845.55 4.46
Given two GPUs and a computation of order k in space, data is partitioned by assigning each GPU half the data set plus (k/2) slices of ghost nodes (Figure 4). Each GPU updates its half of the output, receiving the updated ghost nodes from the neighbor. Data is divided along the slowest varying dimension so that contiguous memory regions are copied during ghost node exchanges. In order to maximize scaling, we overlap the exchange of ghost nodes with kernel execution. Each time step is executed in two phases, as shown in Figure 5. In the first phase, a GPU computes the region corresponding to the ghost nodes in the neighboring GPU. In the second phase, a GPU executes the compute kernel on its remaining data, at the same time exchanging the ghost nodes with its neighbor. For each CPU process controlling a GPU, the exchange involves three memory copies: GPU->CPU, CPU->CPU, and CPU->GPU. CUDA provides asynchronous kernel execution and memory copy calls, which allow both the GPU and the CPU to continue processing during ghost node exchange.
side. MPI was used to spawn one CPU process per GPU. CPU processes exchanged ghost nodes with MPI_Sendrcv calls. Table 3 indicates that communication and computation in Phase 2 are effectively overlapped when using either 2 or 4 GPUs. Scaling is the speedup over a single GPU, achieved by the corresponding GPU number. Note that only the smallest case (544544400 does not scale linearly with 4 GPUs. This is due to the fact that each GPU computes only 100 slices in Phase 2, which takes significantly less time than corresponding communication. Our experiments show that communication overhead is hidden as long as the number of slices per GPU is 200 or greater. Furthermore, we found that 2/3 of the communication time is spend in MPI_Sendrcv, the time for which should be further reduced by using the non-buffered version. The superlinear speedup for the larger data sets is due to the decreased pressure on TLB when a data set is partitioned among several GPUs each GPU traverses a fraction of the address space that a single GPU has to access.
Figure 4. Data distribution between two GPUs Extending the 2-GPU approach to more GPUs doubles the cost of ghost node exchange (each GPU has to communicate with two neighbors). The increased communication cost is still effectively hidden for data sets large enough to warrant data partitioning among GPUs. Performance results (using 16x16 tiles and threadblocks arranged as 16x16 threads) for up to 4 GPUs are summarized in Table 3. As in the 2-GPU case, memory copies were optimized by partitioning data only along the slowestvarying dimension. Measurements were collected on an Infiniband-connected cluster, where each CPU node was connected to two Tesla 10-series GPUs (one half of a Tesla S1070 1-U server, containing 4 GPUs). CPU-GPU communication was carried out via cudaMemcpyAsync calls, using page-locked memory on the CPU Figure 5. Two phases of a time step for a 2-GPU implementation of FD
than one time-step before communicating. This would be particularly interesting for the smaller data sets, where communication overhead is close to, or even greater, than the computation time.
[4]
6. ACKNOWLEDGMENTS
The author would like to thank Scott Morton of Hess Corporation for extensive assistance with the finite difference discretization of the wave equation.
[5]
7. REFERENCES
[1] Baysal, E., Kosloff, D. D., and Sherwood, J. W. C. 1983. Reverse-time migration. Geophysics, 48, 1514-1524. [2] CUDA Programming Guide, 2.1, NVIDIA. http://developer.download.nvidia.com/compute/cuda/2_1/too lkit/docs/NVIDIA_CUDA_Programming_Guide_2.1.pdf [3] Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., and Yelick, K. 2008. Stencil computation optimization and auto-tuning on stateof-the-art multicore architectures. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing (Austin, Texas, [6] [7]
November 15 - 21, 2008). Conference on High Performance Networking and Computing. IEEE Press, Piscataway, NJ, 112. Kamil, S., Datta, K., Williams, S., Oliker, L., Shalf, J., and Yelick, K. 2006. Implicit and explicit optimizations for stencil computations. In Proceedings of the 2006 Workshop on Memory System Performance and Correctness (San Jose, California, October 22 - 22, 2006). MSPC '06. ACM, New York, NY, 51-60. Lindholm, E., Nickolls, J., Oberman, S., Montrym, J. 2008. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro 28, 2 (Mar. 2008), 39-55. McMechan, G. A. 1983. Migration by extrapolation of timedependent boundary values. Geophys. Prosp., 31, 413-420. Nickolls, J., Buck, I., Garland, M., and Skadron, K. 2008. Scalable Parallel Programming with CUDA. Queue 6, 2 (Mar. 2008), 40-53.
float infront1, infront2, infront3, infront4; float behind1, behind2, behind3, behind4; floatcurrent; int tx = threadIdx.x + radius; int ty = threadIdx.y + radius;
// threads x-index into corresponding shared memory tile (adjusted for halos) // threads y-index into corresponding shared memory tile (adjusted for halos)
// fill the "in-front" and "behind" data behind3 = g_input[in_idx]; in_idx += stride; behind2 = g_input[in_idx]; in_idx += stride; behind1 = g_input[in_idx]; in_idx += stride; current = g_input[in_idx]; out_idx = in_idx; in_idx += stride; infront1 = g_input[in_idx]; in_idx += stride; infront2 = g_input[in_idx]; in_idx += stride; infront3 = g_input[in_idx]; in_idx += stride; infront4 = g_input[in_idx]; in_idx += stride; for(int i=radius; i<dimz-radius; i++) { ////////////////////////////////////////// // advance the slice (move the thread-front) behind4 = behind3; behind3 = behind2; behind2 = behind1; behind1 = current; current = infront1; infront1 = infront2; infront2 = infront3; infront3 = infront4; infront4 = g_input[in_idx]; in_idx += stride; out_idx += stride; __syncthreads(); ///////////////////////////////////////// // update the data slice in smem if(threadIdx.y<radius) // halo above/below { s_data[threadIdx.y][tx] = g_input[out_idx-radius*dimx]; s_data[threadIdx.y+BDIMY+radius][tx] = g_input[out_idx+BDIMY*dimx]; } if(threadIdx.x<radius) // halo left/right { s_data[ty][threadIdx.x] = g_input[out_idx-radius]; s_data[ty][threadIdx.x+BDIMX+radius] = g_input[out_idx+BDIMX]; } // update the slice in smem s_data[ty][tx] = current; __syncthreads(); ///////////////////////////////////////// // compute the output value float temp = 2.f*current - g_output[out_idx]; float div = c_coeff[0] * current; div += c_coeff[1]*( infront1 + behind1 + s_data[ty-1][tx] + s_data[ty+1][tx] div += c_coeff[2]*( infront2 + behind2 + s_data[ty-2][tx] + s_data[ty+2][tx] div += c_coeff[3]*( infront3 + behind3 + s_data[ty-3][tx] + s_data[ty+3][tx] div += c_coeff[4]*( infront4 + behind4 + s_data[ty-4][tx] + s_data[ty+4][tx] g_output[out_idx] = temp + div*g_vsq[out_idx]; } }