CUDA Tricks PDF
CUDA Tricks PDF
CUDA Tricks PDF
Presented by
D
Damodaran
d R
Ramani
i
Synopsis
Scan Algorithm
Applications
l
Specialized Libraries
These
ese p
primitive
t ep programs
og a s spa
spawn a
thread for each primitive to keep the parallel
processors full
p
Stream programming model (particle
systems, image processing, grid-based
fluid simulations, and dense matrix algebra)
Fragment
g program
p g operating
p g on n fragments
g
(accesses - O(n))
Problem arises when access requirements are
complex (eg: prefix-sum – O(n2))
Prefix-Sum Example
in: 3 1 7 0 4 1 6 3
out: 0 3 4 11 11 14 16 22
Trivial Sequential Implementation
void scan(int* in, int* out, int n)
{
out[0] = 0;
for (int i = 1; i < n; i++)
out[i] = in[i-1] + out[i-1];
}
Scan: An Efficient Parallel Primitive
Complexity O(nlog2n)
A work efficient parallel scan
Goal is a parallel scan that is O(n)
instead of O((nlog
g2n)
Solution:
Balanced Trees: Build a binaryy tree on the
input data and sweep it to and from the
root.
Bi
Binary tree
t ith n leaves
with l h d=log
has l 2n levels,
l l
each level d has 2d nodes
One add is performed per node
node, therefore
O(n) add on a single traversal of the tree.
O(n) unsegmented scan
Reduce/Up-Sweep
for(d = 0; d < log2n-1; d++)
for all k=0; k < n-1; k+=2d+1 in parallel
x[k+2d+1-1] = x[k+2d-1] + x[k+2d+1-1]
D
Down-Sweep
S
x[n-1] = 0;
for(d
( = logg2n – 1;; d >=0;
; d--)
)
for all k = 0; k < n-1; k += 2d+1 in parallel
t = x[k + 2d – 1]
x[k + 2d - 1] = x[k + 2d+1 -1]
1]
x[k + 2d+1 - 1] = t + x[k + 2d+1 – 1]
Tree analogy
Up-Sweep
Down-Sweep
Features of segmented scan
3 times slower than unsegmented scan
Useful for building broad variety of
applications which are not possible with
unsegmented scan.
scan
Primitives built on scan
Enumerate
enumerate([t f f t f t t]) = [0 1 1 1 2 2 3]
Exclusive scan of input vector
Distribute (copy)
distribute([a b c][d e]) = [a a a][d d]
I l i scan off input
Inclusive i t vector
t
Split and split-and-segment
Split divides the input vector into two pieces, with all the
elements marked false on the left side of the output vector and all the
elements marked true on the right.
Applications
Quicksort
Sparse Matrix-Vector Multiply
Tridiagonal Matrix Solvers and Fluid
Simulation
Radix Sort
Stream Compaction
Summed-Area
Summed Area Tables
Quicksort
Sparse Matrix-Vector
Multiplication
Stream Compaction
Definition:
Extracts the ‘interest’ elements from an array of
elements and places them continuously in a new
array
Uses:
Collision Detection
Sparse Matrix Compression
A B A D D E C F B
A B A C B
Stream Compaction
A B A D D E C F B Input: We want to
preserve the gray
elements
1 1 1 0 0 0 1 0 1 Set a ‘1’ in each gray
input
Scan
0 1 2 3 3 3 3 4 4
A B A D D E C F B
Scatter
S tt gray iinputs
t tto
output using scan result
as scatter address
A B A C B
0 1 2 3 4
Radix Sort Using Scan
100 111 010 110 011 101 001 000 Input Array
0 1 0 0 1 1 1 0 b = least significant bit
e = Insert a 1 for all
1 0 1 1 0 0 0 1 false sort keys
0 1 1 2 3 3 3 3 f = Scan the 1s
0 4 1 2 5 6 7 3 d=b?t:f