CUDA Tricks PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

CUDA Tricks

Presented by
D
Damodaran
d R
Ramani
i
Synopsis
 Scan Algorithm

 Applications
l

 Specialized Libraries

 CUDPP: CUDA Data Parallel Primitives Library

 Thrust: a Template Library for CUDA Applications

 CUDA FFT and BLAS libraries for the GPU


References
 Scan primitives for GPU Computing.
 Shubhabrata Sengupta, Mark Harris, Yao Zhang, and John D. Owens

 Presentation on scan primitives by Gary J. Katz based on the article


Parallel Prefix Sum (Scan) with CUDA - Harris, Sengupta and Owens
(GPU GEMS Chapter
Ch 39)
Introduction
 GPUs massively parallel processors

 Programmable parts of the graphics pipeline


operates
p on primitives
p (vertices,
( , fragments)
g )

 These
ese p
primitive
t ep programs
og a s spa
spawn a
thread for each primitive to keep the parallel
processors full
p
 Stream programming model (particle
systems, image processing, grid-based
fluid simulations, and dense matrix algebra)
 Fragment
g program
p g operating
p g on n fragments
g
(accesses - O(n))
 Problem arises when access requirements are
complex (eg: prefix-sum – O(n2))
Prefix-Sum Example
 in: 3 1 7 0 4 1 6 3
 out: 0 3 4 11 11 14 16 22
Trivial Sequential Implementation
void scan(int* in, int* out, int n)
{
out[0] = 0;
for (int i = 1; i < n; i++)
out[i] = in[i-1] + out[i-1];
}
Scan: An Efficient Parallel Primitive

 Interested in finding efficient solutions to


parallel problems in which each output
requires global knowledge of the inputs.

 Why CUDA? (General Load-Store Memory


Architecture, On-chip Shared Memory, Thread
S
Synchronization)
h i i )
Threads & Blocks
 GeForce 8800 GTX ( 16 multiprocessors, 8 processors each)
 CUDA structures GPU programs into parallel thread blocks of up
to 512 SIMD
SIMD-parallel
parallel threads.
threads
 Programmers specify the number of thread blocks and threads
per block, and the hardware and drivers map thread blocks to
parallel multiprocessors on the GPU.
GPU
 Within a thread block, threads can communicate
through shared memory and cooperate through sync.
 B
Because only
l threads
th d withinithi the
th same block
bl k can cooperate
t via
i
shared memory and thread synchronization,programmers must
partition computation into multiple blocks.(complex
programming large performance benefits)
programming,
The Scan Operator
 Definition:
 The scan operation takes a binary associative
operator with identity I, and an array of n
elements
[a0, a1, …, an-11]
and returns the array
[I, a0, (a0 a1), … , (a0 a1 … an-2)]

Types – inclusive, exclusive, forward, backward


Parallel Scan
for(d = 1; d < log2n; d++)
for all k in parallel
( k >= 2d )
if(
x[out][k] = x[in][k – 2d-1] + x[in][k]
else
[ ][ ] = x[in][k]
x[out][k] [ ][ ]

Complexity O(nlog2n)
A work efficient parallel scan
 Goal is a parallel scan that is O(n)
instead of O((nlog
g2n)
 Solution:
 Balanced Trees: Build a binaryy tree on the
input data and sweep it to and from the
root.
Bi
Binary tree
t ith n leaves
with l h d=log
has l 2n levels,
l l
each level d has 2d nodes
One add is performed per node
node, therefore
O(n) add on a single traversal of the tree.
O(n) unsegmented scan
 Reduce/Up-Sweep
for(d = 0; d < log2n-1; d++)
for all k=0; k < n-1; k+=2d+1 in parallel
x[k+2d+1-1] = x[k+2d-1] + x[k+2d+1-1]

 D
Down-Sweep
S
x[n-1] = 0;
for(d
( = logg2n – 1;; d >=0;
; d--)
)
for all k = 0; k < n-1; k += 2d+1 in parallel
t = x[k + 2d – 1]
x[k + 2d - 1] = x[k + 2d+1 -1]
1]
x[k + 2d+1 - 1] = t + x[k + 2d+1 – 1]
Tree analogy

x0 ∑(x0..x1) x2 ∑(x0..x3) x4 ∑(x4..x5) x6 ∑(x0..x7)

x0 ∑(x0..x1) x2 ∑(x0..x3) x4 ∑(x4..x5) x6 0

x0 ∑(x0..x1) x2 0 x4 ∑(x4..x5) x6 ∑(x0..x3)

x0 0 x2 ∑(x0..x1) x4 ∑(x0..x3) x6 ∑(x0..x5)

0 x ∑(x0..x1) ∑(x0..x2) ∑(x0..x3) ∑(x0..x4) ∑(x0..x5) ∑(x0..x6)


0
O(n) Segmented Scan

Up-Sweep
 Down-Sweep
Features of segmented scan
 3 times slower than unsegmented scan
 Useful for building broad variety of
applications which are not possible with
unsegmented scan.
scan
Primitives built on scan
 Enumerate
 enumerate([t f f t f t t]) = [0 1 1 1 2 2 3]
 Exclusive scan of input vector
 Distribute (copy)
 distribute([a b c][d e]) = [a a a][d d]
 I l i scan off input
Inclusive i t vector
t
 Split and split-and-segment
Split divides the input vector into two pieces, with all the
elements marked false on the left side of the output vector and all the
elements marked true on the right.
Applications
 Quicksort
 Sparse Matrix-Vector Multiply
 Tridiagonal Matrix Solvers and Fluid
Simulation
 Radix Sort
 Stream Compaction
 Summed-Area
Summed Area Tables
Quicksort
Sparse Matrix-Vector
Multiplication
Stream Compaction
Definition:
 Extracts the ‘interest’ elements from an array of
elements and places them continuously in a new
array
 Uses:
 Collision Detection
 Sparse Matrix Compression
A B A D D E C F B

A B A C B
Stream Compaction
A B A D D E C F B Input: We want to
preserve the gray
elements
1 1 1 0 0 0 1 0 1 Set a ‘1’ in each gray
input
Scan
0 1 2 3 3 3 3 4 4

A B A D D E C F B
Scatter
S tt gray iinputs
t tto
output using scan result
as scatter address
A B A C B

0 1 2 3 4
Radix Sort Using Scan
100 111 010 110 011 101 001 000 Input Array
0 1 0 0 1 1 1 0 b = least significant bit
e = Insert a 1 for all
1 0 1 1 0 0 0 1 false sort keys
0 1 1 2 3 3 3 3 f = Scan the 1s

Total Falses = e[n-1] + f[n-1]


0-0+4 1-1+4 2-1+4 3-2+4 4-3+4 5-3+4 6-3+4 7-3+4
=4 =4 =5 =5 =5 =6 =7 =8 t = index – f + Total Falses

0 4 1 2 5 6 7 3 d=b?t:f

100 111 010 110 011 101 001 000


Scatter input using d
as scatter address
100 010 110 000 111 011 101 001
Specialized Libraries
 CUDPP: CUDA Data Parallel Primitives
Library
 CUDPP is a library of data-parallel
algorithm
g primitives
p such as p
parallel prefix-
p
sum (”scan”), parallel sort and parallel
reduction.
CUDPP_DLL CUDPPResult cudppSparseMa
trixVectorMultiply(CUDPPHandle sparse
MatrixHandle,void * d_y,const void
* d_x )
Perform matrix-vector multiply y = A*x for
arbitrary sparse matrix A and vector x.
CUDPPScanConfig config;
config.direction = CUDPP_SCAN_FORWARD;
config.exclusivity = CUDPP_SCAN_EXCLUSIVE;
config.op = CUDPP_ADD;
config datatype = CUDPP_FLOAT;
config.datatype CUDPP FLOAT;
config.maxNumElements = numElements;
config.maxNumRows = 1;
config.rowPitch = 0;
cudppInitializeScan(&config);
cudppScan(d odata d_idata,
cudppScan(d_odata, d idata numElements,
numElements &config);
CUFFT
 No. of elements<8192 slower than fftw
 >8192,
>8192 5x speedup over threaded fftw
and 10x over serial fftw.
CUBLAS
 Cuda Based Linear Algebra Subroutines
 Saxpy, conjugate gradient, linear solvers.
 3D reconstruction of planetary nebulae.
 http://graphics.tu-
bs.de/publications/Fernandez08TechReport.pdf
 GPU Variant 100 times faster than CPU
version
 Matrix size is limited by graphics card
memory and texture sizesize.
 Although taking advantage of sparce
matrices will help reduce memory
consumption, sparse matrix storage is
not implemented by CUBLAS
CUBLAS.
Useful Links
 http://www.science.uwaterloo.ca/~hmerz/CUDA_ben
chFFT/
 http://developer.download.nvidia.com/compute/cuda
/2_0/docs/CUBLAS_Library_2.0.pdf
 http://gpgpu org/developer/cudpp
http://gpgpu.org/developer/cudpp
 http://gpgpu.org/2009/05/31/thrust

You might also like