Data - Parallel Algorithms On Gpus
Data - Parallel Algorithms On Gpus
Mark Harris
NVIDIA Developer Technology
Outline
• Introduction
• Algorithmic complexity on GPUs
• Algorithmic Building Blocks
– Gather & Scatter
– Reductions
– Scan (parallel prefix)
– Sort
– Search
Data-Parallel Algorithms
• The GPU is a data-parallel processor
– Data-parallel kernels of applications can be
accelerated on the GPU
• Efficient algorithms require efficient
building blocks
• This talk: data-parallel building blocks
– Gather & Scatter
– Map
– Reduce and Scan
– Sort and Search
Algorithmic Complexity on GPUs
• We will use standard “Big O” notation
– e.g., optimal sequential sort is O(n log n)
• GPGPU element of parallelism is the pixel
– Each pixel generates one output element
– O(n) typically means n pixels processed
• In general, GPGPU O(n) usually means O(n/p)
processing time
– p is the number of “pixel processors” on the GPU
• NVIDIA G70 has 24 pixel shader pipelines
• NVIDIA G80 has 128 unified shader processors
Step vs. Work Complexity
• Important to distinguish between the two
+ + N/4…
+ 1
N/2
N
O(log2N) steps, O(N) work
Multiple 1D Parallel Reductions
• Can run many reductions in parallel
– Use 2D texture and reduce one dimension
+ + + Mx1
MxN/4…
MxN/2
MxN
O(log2N) steps, O(MN) work
2D reductions
• Like 1D reduction, only reduce in both
directions simultaneously
• Example:
scan(+, [3 1 7 0 4 1 6 3]) =
[3 4 11 11 14 16 22 25]
(From Blelloch, 1990, “Prefix Sums and Their Applications”)
Applications of Scan
• Radix sort • Stream compaction
• Quicksort • Polynomial evaluation
• String comparison • Solving recurrences
• Lexical analysis • Tree operations
• Stream compaction • Histograms
A Naive Parallel Scan Algorithm
Log(n) iterations
• Due to ping-pong,
render a 2nd quad
from 2(i-1) to 2i with a
simple pass-through
shader
vout = vin.
A Naive Parallel Scan Algorithm
Log(n) iterations
• Due to ping-pong,
render a 2nd quad
from 2(i-1) to 2i with a
simple pass-through
shader
vout = vin.
A Naive Parallel Scan Algorithm
Log(n) iterations
• Due to ping-pong,
render a 2nd quad
from 2(i-1) to 2i with a
simple pass-through
shader
vout = vin.
A Naive Parallel Scan Algorithm
Log(n) iterations
vout = vin.
A Naive Parallel Scan Algorithm
Log(n) iterations
vout = vin.
A Naive Parallel Scan Algorithm
• Algorithm given in more detail in [Horn ‘05]
• Step-efficient, but not work-efficient
– O(log n) steps, but O(n log n) adds
– Sequential version is O(n)
– A factor of log(n) hurts: 20x for 10^6 elements!
• Note: tricky to
implement using
graphics API
– Due to interleaving of
new and old results
– Can reformulate layout