0% found this document useful (0 votes)
96 views6 pages

Micikevicius, P. - 3D Finite DIfference Computation On GPUs Using CUDA

cuda

Uploaded by

fonseca_r
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
96 views6 pages

Micikevicius, P. - 3D Finite DIfference Computation On GPUs Using CUDA

cuda

Uploaded by

fonseca_r
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

3D Finite Difference Computation on GPUs using CUDA

Paulius Micikevicius
NVIDIA 2701 San Tomas Expressway Santa Clara, CA 95050

ABSTRACT
In this paper we describe a GPU parallelization of the 3D finite difference computation using CUDA. Data access redundancy is used as the metric to determine the optimal implementation for both the stencil-only computation, as well as the discretization of the wave equation, which is currently of great interest in seismic computing. For the larger stencils, the described approach achieves the throughput of between 2,400 to over 3,000 million of output points per second on a single Tesla 10-series GPU. This is roughly an order of magnitude higher than a 4-core Harpertown CPU running a similar code from seismic industry. Multi-GPU parallelization is also described, achieving linear scaling with GPUs by overlapping inter-GPU communication with computation.

t Dx , y , z refers to data point at position (x,y,z), computed at time-step

t, cj is a multiplicative coefficient applied to elements at distance j from the output point.


t +1 t t t t t t t Dx , y , z = c 0 D x , y , z + c i D x i , y , z + D x + i , y , z + D x , y i , z + D x , y + i , z + D x , y , z i + D x , y , z + i k 2

i =1

The paper is organized as follows. Section 2 reviews the CUDA programming model and GPU architecture. CUDA implementation of the 3D stencil computation is described in Section 3. Performance results are presented in Section 4. Section 5 includes the conclusions and some directions for future work.

Categories and Subject Descriptors


D.1.3 [Concurrent Programming]: Parallel Programming.

General Terms
Algorithms, Performance, Measurement.

Keywords
Finite Difference, GPU, CUDA, Parallel Algorithms.

Figure 1. A high-level view of GPU architecture.

1. INTRODUCTION
In this paper we describe a parallelization of the 3D finite difference computation, intended for GPUs and implemented using NVIDIAs CUDA framework. The approach utilizes thousands of threads, traversing the volume slice-by-slice as a 2D front of threads in order to maximize data reuse from shared memory. GPU performance is measured for the stencil-only computation (Equation 1), as well as for the finite difference discretization of the wave equation. The latter is the basis for the reverse time migration algorithm (RTM) [6] in seismic computing. An order-k in space stencil refers to a stencil that requires k input elements in each dimension, not counting the element at the intersection. Alternatively, one could refer to the 3D order-k stencil as a (3k + 1)-point stencil. Equation below defines the stencil computation for a three-dimensional, isotropic case.

2. CUDA AND GPU ARCHITECTURE


CUDA allows the programming of GPUs for parallel computation without any graphics knowledge [5][7]. As shown in Figure 1, a GPU is presented as a set of multiprocessors, each with its own stream processors and shared memory (user-managed cache). The stream processors are fully capable of executing integer and single precision floating point arithmetic, with additional cores used for double-precision. All multiprocessors have access to global device memory, which is not cached by the hardware. Memory latency is hidden by executing thousands of threads concurrently. Register and shared memory resources are partitioned among the currently executing threads. There are two major differences between CPU and GPU threads. First, context switching between threads is essentially free state does not have to be stored/restored because GPU resources are partitioned. Second, while CPUs execute efficiently when the number of threads per core is small (often one or two), GPUs achieve high performance when thousands of threads execute concurrently. CUDA arranges threads into threadblocks. All threads in a threadblock can read and write any shared memory location assigned to that threadblock. Consequently, threads within a threadblock can communicate via shared memory, or use shared memory as a user-managed cache since shared memory latency is two orders of magnitude lower than that of global memory. A barrier primitive is provided so that all threads in a threadblock can synchronize their execution. More detailed descriptions of

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. GPGPU2, March 8, 2009, Washington D.C., US. Copyright 2009 ACM 978-1-60558-517-8/09/03$5.00.

the architecture and programming model can be found in [5] [7][2]. The Tesla 10-series GPUs (Tesla S1060/S1070) contain 30 multiprocessors, each multiprocessor contains 8 streaming processors (for a total of 240), 16K 32-bit registers, and 16 KB of shared memory. Theoretical global-memory bandwidth is 102 GB/s, available global memory is 4GB.

3.1 Extending the Computation to 3D


Once a 2D tile and halos are loaded into shared memory, each threadblock straightforwardly computes the 2D stencil for its output tile. All the data is fetched from shared memory, the latency of which is two orders of magnitude lower than that of global memory. There are two approaches to extend the computation to three dimensions: two-pass and single-pass. These are similar to stencil approaches described for Cell processors in [1][4]. As the name suggests, the two-pass approach traverses the input volume twice. During the first pass only the 2D stencil values are computed. The 1D stencil (in the remaining dimension) is computed during the second pass, combining it with the partial results from the first pass. Consequently, read redundancy is 3 plus redundancy due to halos the first pass reads tiles and halos into shared memory, the second pass reads the original input and partial results. Returning to the example of an order-8 stencil and 1616 tiles, read redundancy is 4. Since each pass writes output, write redundancy is 2. Overall redundancy is 6, which is an improvement over the nave approach, which has redundancy of 26 (counting reads and writes). Efficiency can be further increased with the single-pass approach.

3. EFFICIENT 3D STENCIL COMPUTATION ON TESLA GPUs


Utilizing memory bandwidth efficiently is key to algorithm implementation on GPUs, since global memory accesses are not implicitly cached by the hardware. We use memory access redundancy as a metric to assess implementation efficiency. Specifically, redundancy is the ratio between the number of elements accessed and the number of elements processed. Where applicable, we refer to read redundancy, which does not include writes in the access count. Write redundancy is defined similarly. Ideally, read and write redundancies would be 1 for the stencilonly computation, overall redundancy would be 2.

Figure 2. 16x16 data tile and halos for order-8 stencil in shared memory The nave approach to compute an order-k stencil refetches all (3k + 1) input elements for every output value, leading to (3k + 1) read redundancy. Our implementation reduces redundancy by performing calculations from shared memory. Since 16KB of shared memory available per multiprocessor is not sufficient to store a significantly large 3D subdomain of a problem, a 2D tile is stored instead (Figure 2). Extension of the computation to the 3rd dimension is discussed in the next section. Threads are grouped into 2D threadblocks to match data tiling, assigning one thread per output element. Given an order-k stencil and nm threadblocks and output tiles, an (n + k)(m + k) shared memory array is needed to accommodate the data as well the four halo regions. Even though space for k2 elements could be saved by storing the halos separately (the four (k/2)(k/2) corners of the array are not used), savings are not significant enough to justify increased code complexity. Since halo elements are read by at least two threadblocks, the read redundancy of loading the data into shared memory arrays is (nm + kn + km)/(nm). For example, read redundancy is 2 for an order-8 stencil when using threadblocks configured as 16x16 threads (2424 shared memory array). Increasing threadblock and tile dimensions to 3232 reduces redundancy to 1.5, in this case threadblocks contain 512 threads (arranged as 32x16), each thread computing two output values. Figure 3. Element re-use by a group of threads The two passes described above can be merged into a single one. Let z be the slowest varying dimension. If we assign a thread to compute output values for a given column along z, no additional shared memory storage is needed. Threads of a given threadblock coherently traverse the volume along z, computing output for each slice. While the elements in the current slice are needed for computation by multiple threads, elements in the slices preceding and succeeding the current z-position are used only by the threads corresponding to the elements (x, y) position (Figure 3). Thus, input elements in the current slice are stored in shared memory, while each thread stores the input from the preceding/succeeding k/2 slices it needs in local variables (which in CUDA are usually placed in registers). Using the case depicted in Figure 3, the four threads would access the 32 elements in the xy-plane from shared memory, while the elements along the z axis would be stored in corresponding threads registers. Once all the threads in a threadblock write the results for the current slice, values in the local variables are shifted, reading in a new element at distance (k/2 + 1): the k local variables and shared memory are used as a queue. Output is written exactly once, input is read with (nm + kn + km)/(nm) redundancy due to halos. For example, redundancy for an order-8 stencil with 16x16 tiles is 3, compared to 6 of the two-pass approach.

k 2 t +1 t t 1 t t t t t t t Dx , y , z = 2 D x , y , z D x , y , z + v x , y , z c 0 D x , y , zk + c i D x i , y , z + D x + i , y , z + D x , y i , z + D x , y + i , z + D x , y , z i + D x , y , z + i i =1

(Equation 2)

4. EXPERIMENTAL RESULTS
This section describes experiments with two types of kernels stencil-only and finite difference of the wave equation. Performance was measured on Tesla S1070 servers, containing four GPUs each. S1070 servers were connected to cluster CPU nodes running CUDA 2.0 toolkit and driver (64-bit RHEL). Throughput in millions of output points per second (Mpoints/s) was used as the metric. Multi-GPU experiments are also included in this section, since practical working sets for the finitedifference in time domain of the wave equation exceed the 4GB memory capacity of currently available Tesla GPUs. The prototype kernels do not account for boundary conditions. In order to avoid out-of-bounds accesses, the order-k kernel does not compute output for the first and last k/2 slices in the slowest dimension, while the k/2 boundary slices in each of the remaining 4 directions are computed with data fetched at usual offsets. Consequently, the 4 boundaries are incorrect as at least one radius of the stencil extends into inappropriate data. Since the intent of this study is to determine peak performance, we chose to ignore boundary conditions. Furthermore, boundary processing varies based on application as well as implementation, making experiments with general code not feasible. While we expect the performance to decrease once boundary handling is integrated, we believe performance reported below to be a reliable indication of current Tesla GPU capabilities.

varied significantly across the configurations, memory throughput in GB/s (counting both reads and writes) was much more consistent, varying between 45 and 55 GB/s.

4.2 3D Finite Equation

Difference

of

the

Wave-

Finite difference discretization of the wave equation is a major building block for the Reverse Time Migration (RTM) [6][1], a technique of great interest in seismic imaging. While RTM has been known since 1980s, until very recently its computational cost has been too high for practical purposes. Due to advances in computer architecture and the potential for higher quality results, adoption of the RTM for production seismic computing has started in the last couple of years. The stencil-only computation is easily extended to the time-domain finite difference of the wave equation, second order in time (Equation 2 above). In addition to reading data from the past two time steps, array v (inverse of velocity squared, in practice) is added to the input. Computing an output element requires (7k/2) + 4 floating point operations and 4 memory accesses, not accounting for redundancy due to halos. Therefore, ideal redundancy would be 4. CUDA source code for the 4th order in space wave equation kernel (using 16x16 tiles and threadblocks) is listed in Appendix A. Table 2. 3D FDTD (8th order in space, 2nd order in time) throughput in Mpoints/s
data dimensions dimx dimy dimz 320 320 400 480 480 480 544 544 544 640 640 400 800 800 200 tile dimensions 16x16 32x32 2,870.7 2,783.5 2,965.5 3,050.5 2,786.5 3,121.6 2,686.9 3,046.8 2,518.3 3,196.9

4.1 Stencil-only Computation


Table 1 summarizes performance of the stencil-only kernel for varying orders in space as well as varying input volume sizes. All configurations were processed by 1616 threadblocks, operating on 1616 output tiles. Since tile size was fixed, the halos for the higher orders consumed a larger percentage of read bandwidth. For example, for the 8th order in space, halos (four 416 regions) consume as much bandwidth as the tile itself, leading to 2x read redundancy. For the 12th order in space, read redundancy is 2.5x. Using 3232 tiles would reduce redundancy to 1.5x and 1.75x, respectively. Table 1. 3D stencil-only throughput in Mpoints/s for various orders in space
Dimensions 480x480x400 544x544x400 640x640x400 800x800x400 800x800x800 6 4,455 4,389 4,168 3,807 3,780 Order in space 8 10 4,269 3,435 4,214 3,347 3,932 3,331 3,717 3,236 3,656 3,247 12 2,885 2,816 2,802 2,752 2,302

Performance measurements for the 8th order in space are summarized in Table 2. Two kernel versions were implemented. The first one utilizes 1616 threadblocks and output tiles (redundancy is 5). The second implementation used 3216 threadblocks to compute 3232 output tiles (redundancy is 4.5). GPU performance is roughly an order of magnitude higher than a single 4-core Harpertown Xeon, running an optimized implementation of the same computation.

4.3 Multi-GPU Implementation of the WaveEquation Finite Difference


The working set for practical finite difference computations sometimes exceeds the 4 GB of memory available on a single GPU. For example, processing a single shot (a unit of computation for RTM) can easily require more than 10 GB of storage. It is therefore necessary to partition the data among several GPUs, at the same time scaling the performance. For simplicity we will examine a 2-GPU case, which is straightforwardly extended to more GPUs.

For fixed volume dimensions, throughput decrease with increased orders is largely due to higher read redundancy, additional arithmetic being another contributing factor. For a fixed order in space, increased memory footprint of larger volumes affects the TLB performance (480480400 working set is 703MB, while 800800800 requires 3.8GB). While throughput in Mpoints/s

Table 3. Multi-GPU 3D FDTD (8th order in space, 2nd order in time) performance and scaling
Data dimensions dimx dimy dimz 480 480 800 544 544 400 544 544 800 640 640 640 640 640 800 1 GPU Mpnts/s scaling 2,986.85 1.00 2,826.35 1.00 2,736.89 1.00 2,487.17 1.00 2,433.94 1.00 2 GPUs Mpnts/s scaling 5,944.98 1.99 5,545.63 1.96 5,459.69 1.99 5,380.89 2.16 5,269.04 2.16 4 GPUs Mpnts/s scaling 11,845.90 3.97 6,453.15 2.28 11,047.20 4.04 10,298.97 4.14 10,845.55 4.46

Given two GPUs and a computation of order k in space, data is partitioned by assigning each GPU half the data set plus (k/2) slices of ghost nodes (Figure 4). Each GPU updates its half of the output, receiving the updated ghost nodes from the neighbor. Data is divided along the slowest varying dimension so that contiguous memory regions are copied during ghost node exchanges. In order to maximize scaling, we overlap the exchange of ghost nodes with kernel execution. Each time step is executed in two phases, as shown in Figure 5. In the first phase, a GPU computes the region corresponding to the ghost nodes in the neighboring GPU. In the second phase, a GPU executes the compute kernel on its remaining data, at the same time exchanging the ghost nodes with its neighbor. For each CPU process controlling a GPU, the exchange involves three memory copies: GPU->CPU, CPU->CPU, and CPU->GPU. CUDA provides asynchronous kernel execution and memory copy calls, which allow both the GPU and the CPU to continue processing during ghost node exchange.

side. MPI was used to spawn one CPU process per GPU. CPU processes exchanged ghost nodes with MPI_Sendrcv calls. Table 3 indicates that communication and computation in Phase 2 are effectively overlapped when using either 2 or 4 GPUs. Scaling is the speedup over a single GPU, achieved by the corresponding GPU number. Note that only the smallest case (544544400 does not scale linearly with 4 GPUs. This is due to the fact that each GPU computes only 100 slices in Phase 2, which takes significantly less time than corresponding communication. Our experiments show that communication overhead is hidden as long as the number of slices per GPU is 200 or greater. Furthermore, we found that 2/3 of the communication time is spend in MPI_Sendrcv, the time for which should be further reduced by using the non-buffered version. The superlinear speedup for the larger data sets is due to the decreased pressure on TLB when a data set is partitioned among several GPUs each GPU traverses a fraction of the address space that a single GPU has to access.

Figure 4. Data distribution between two GPUs Extending the 2-GPU approach to more GPUs doubles the cost of ghost node exchange (each GPU has to communicate with two neighbors). The increased communication cost is still effectively hidden for data sets large enough to warrant data partitioning among GPUs. Performance results (using 16x16 tiles and threadblocks arranged as 16x16 threads) for up to 4 GPUs are summarized in Table 3. As in the 2-GPU case, memory copies were optimized by partitioning data only along the slowestvarying dimension. Measurements were collected on an Infiniband-connected cluster, where each CPU node was connected to two Tesla 10-series GPUs (one half of a Tesla S1070 1-U server, containing 4 GPUs). CPU-GPU communication was carried out via cudaMemcpyAsync calls, using page-locked memory on the CPU Figure 5. Two phases of a time step for a 2-GPU implementation of FD

5. CONCLUSTIONS AND FUTURE WORK


We have described a GPU parallelization approach for 3D finite difference stencil computation that achieves approximately an order of magnitude speedup over similar seismic industry standard codes. We also described the approach for utilizing multiple GPUs to solve the problem, achieving linear scaling with GPUs by using asynchronous communication and computation. This allows for GPU processing of large data sets common in practice, often exceeding 10 GB in size. There are at least two directions for further work. One, it would be interesting to modify the parallelization to carry out several time-steps in shared memory for the smaller stencils. Two, multi-GPU parallelization could also benefit from storing larger ghost regions and computing more

than one time-step before communicating. This would be particularly interesting for the smaller data sets, where communication overhead is close to, or even greater, than the computation time.

[4]

6. ACKNOWLEDGMENTS
The author would like to thank Scott Morton of Hess Corporation for extensive assistance with the finite difference discretization of the wave equation.

[5]

7. REFERENCES
[1] Baysal, E., Kosloff, D. D., and Sherwood, J. W. C. 1983. Reverse-time migration. Geophysics, 48, 1514-1524. [2] CUDA Programming Guide, 2.1, NVIDIA. http://developer.download.nvidia.com/compute/cuda/2_1/too lkit/docs/NVIDIA_CUDA_Programming_Guide_2.1.pdf [3] Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., and Yelick, K. 2008. Stencil computation optimization and auto-tuning on stateof-the-art multicore architectures. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing (Austin, Texas, [6] [7]

November 15 - 21, 2008). Conference on High Performance Networking and Computing. IEEE Press, Piscataway, NJ, 112. Kamil, S., Datta, K., Williams, S., Oliker, L., Shalf, J., and Yelick, K. 2006. Implicit and explicit optimizations for stencil computations. In Proceedings of the 2006 Workshop on Memory System Performance and Correctness (San Jose, California, October 22 - 22, 2006). MSPC '06. ACM, New York, NY, 51-60. Lindholm, E., Nickolls, J., Oberman, S., Montrym, J. 2008. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro 28, 2 (Mar. 2008), 39-55. McMechan, G. A. 1983. Migration by extrapolation of timedependent boundary values. Geophys. Prosp., 31, 413-420. Nickolls, J., Buck, I., Garland, M., and Skadron, K. 2008. Scalable Parallel Programming with CUDA. Queue 6, 2 (Mar. 2008), 40-53.

Appendix A: CUDA Source Code for a 25-Point Stencil


__global__ void fwd_3D_16x16_order8(TYPE *g_output, TYPE *g_input, TYPE *g_vsq, // output initially contains (t-2) step const int dimx, const int dimy, const int dimz) { #define BDIMX 16 // tile (and threadblock) size in x #define BDIMY 16 // tile (and threadblock) size in y #define radius 4 // half of the order in space (k/2) __shared__ float s_data[BDIMY+2*radius][BDIMX+2*radius]; int int int int int ix = blockIdx.x*blockDim.x + threadIdx.x; iy = blockIdx.y*blockDim.y + threadIdx.y; in_idx = iy*dimx + ix; // index for reading input out_idx = 0; // index for writing output stride = dimx*dimy; // distance between 2D slices (in elements) // variables for input in front of the current slice // variables for input behind the current slice // input value in the current slice

float infront1, infront2, infront3, infront4; float behind1, behind2, behind3, behind4; floatcurrent; int tx = threadIdx.x + radius; int ty = threadIdx.y + radius;

// threads x-index into corresponding shared memory tile (adjusted for halos) // threads y-index into corresponding shared memory tile (adjusted for halos)

// fill the "in-front" and "behind" data behind3 = g_input[in_idx]; in_idx += stride; behind2 = g_input[in_idx]; in_idx += stride; behind1 = g_input[in_idx]; in_idx += stride; current = g_input[in_idx]; out_idx = in_idx; in_idx += stride; infront1 = g_input[in_idx]; in_idx += stride; infront2 = g_input[in_idx]; in_idx += stride; infront3 = g_input[in_idx]; in_idx += stride; infront4 = g_input[in_idx]; in_idx += stride; for(int i=radius; i<dimz-radius; i++) { ////////////////////////////////////////// // advance the slice (move the thread-front) behind4 = behind3; behind3 = behind2; behind2 = behind1; behind1 = current; current = infront1; infront1 = infront2; infront2 = infront3; infront3 = infront4; infront4 = g_input[in_idx]; in_idx += stride; out_idx += stride; __syncthreads(); ///////////////////////////////////////// // update the data slice in smem if(threadIdx.y<radius) // halo above/below { s_data[threadIdx.y][tx] = g_input[out_idx-radius*dimx]; s_data[threadIdx.y+BDIMY+radius][tx] = g_input[out_idx+BDIMY*dimx]; } if(threadIdx.x<radius) // halo left/right { s_data[ty][threadIdx.x] = g_input[out_idx-radius]; s_data[ty][threadIdx.x+BDIMX+radius] = g_input[out_idx+BDIMX]; } // update the slice in smem s_data[ty][tx] = current; __syncthreads(); ///////////////////////////////////////// // compute the output value float temp = 2.f*current - g_output[out_idx]; float div = c_coeff[0] * current; div += c_coeff[1]*( infront1 + behind1 + s_data[ty-1][tx] + s_data[ty+1][tx] div += c_coeff[2]*( infront2 + behind2 + s_data[ty-2][tx] + s_data[ty+2][tx] div += c_coeff[3]*( infront3 + behind3 + s_data[ty-3][tx] + s_data[ty+3][tx] div += c_coeff[4]*( infront4 + behind4 + s_data[ty-4][tx] + s_data[ty+4][tx] g_output[out_idx] = temp + div*g_vsq[out_idx]; } }

+ s_data[ty][tx-1] + s_data[ty][tx+1] ); + s_data[ty][tx-2] + s_data[ty][tx+2] ); + s_data[ty][tx-3] + s_data[ty][tx+3] ); + s_data[ty][tx-4] + s_data[ty][tx+4] );

You might also like