Brook For GPUs - Stream Computing On Graphics Hardware - Slides (2004)
Brook For GPUs - Stream Computing On Graphics Hardware - Slides (2004)
Pentium 4
SIGGRAPH 2004 2
recent trends
GPU-based SIGGRAPH/Graphics Hardware papers
13
SIGGRAPH 2004 3
domain specific solutions
requires extensive
knowledge of GPU
programming
SIGGRAPH 2004 4
building an abstraction
SIGGRAPH 2004 5
contributions
• Brook stream programming environment
for GPU-based computing
– language, compiler, and runtime system
SIGGRAPH 2004 6
GPU programming model
each fragment shaded independently
– no dependencies between fragments
• temporary registers are zeroed
• no static variables
• no read-modify-write textures
– multiple “pixel pipes”
SIGGRAPH 2004 7
GPU = data parallel
each fragment shaded independently
– no dependencies between fragments
• temporary registers are zeroed
• no static variables
• no read-modify-write textures
– multiple “pixel pipes”
data parallelism
– support ALU heavy architectures
– hide memory latency
[Torborg and Kajiya 96, Anderson et al. 97, Igehy et al. 98]
SIGGRAPH 2004 8
compute vs. bandwidth
GFLOPS
7x Gap
GFloats/sec
graphics pipeline
– vextex
• BW: 1 vertex = 32 bytes;
• OP: 100-500 f32-ops / vertex
– fragment
• BW: 1 fragment = 10 bytes
• OP: 300-1000 i8-ops/fragment
SIGGRAPH 2004 10
Brook language
stream programming model
– enforce data parallel computing
• streams
– encourage arithmetic intensity
• kernels
SIGGRAPH 2004 11
design goals
• general purpose computing
GPU = general streaming-coprocessor
• GPU-based computing for the masses
no graphics experience required
eliminating annoying GPU limitations
• performance
• platform independent
ATI & NVIDIA
DirectX & OpenGL
Windows & Linux
SIGGRAPH 2004 12
Other languages
• Cg / HLSL / OpenGL Shading Language
+ C-like language for expressing shader computation
– graphics execution model
– requires graphics API for data management and shader
execution
• Sh [McCool et al. '04]
+ functional approach for specifying shaders
• evolved from a shading language
• Connection Machine C*
• StreamIt, StreamC & KernelC, Ptolemy
SIGGRAPH 2004 13
Brook language
C with streams
• streams
– collection of records requiring similar computation
• particle positions, voxels, FEM cell, …
Ray r<200>;
float3 velocityfield<100,100,100>;
– data parallelism
• provides data to operate on in parallel
SIGGRAPH 2004 14
kernels
• kernels
– functions applied to streams
• similar to for_all construct
• no dependencies between stream elements
SIGGRAPH 2004 15
kernels
• kernels arguments
– input/output streams
SIGGRAPH 2004 16
kernels
• kernels arguments
– input/output streams
– gather streams
SIGGRAPH 2004 17
kernels
• kernels arguments
– input/output streams
– gather streams
– iterator streams
SIGGRAPH 2004 18
kernels
• kernels arguments
– input/output streams
– gather streams
– iterator streams
– constant parameters
SIGGRAPH 2004 19
kernels
why not allow direct
Ray-triangle intersection
array operators? kernel void
A+B*C
krnIntersectTriangle(Ray ray<>, Triangle tris[],
RayState oldraystate<>,
GridTrilist trilist[],
out Hit candidatehit<>) {
float idx, det, inv_det;
float3 edge1, edge2, pvec, tvec, qvec;
if(oldraystate.state.y > 0) {
– arithmetic intensity
idx = trilist[oldraystate.state.w].trinum;
edge1 = tris[idx].v1 - tris[idx].v0;
edge2 = tris[idx].v2 - tris[idx].v0;
local to computation
inv_det = 1.0f/det;
tvec = ray.o - tris[idx].v0;
candidatehit.data.y = dot( tvec, pvec );
qvec = cross( tvec, edge1 );
candidatehit.data.z = dot( ray.d, qvec );
candidatehit.data.x = dot( edge2, qvec );
– explicit
candidatehit.data.xyz *= inv_det;
candidatehit.data.w = idx;
communication
} else {
candidatehit.data = float4(0,0,0,-1);
}
• kernel arguments
}
SIGGRAPH 2004 20
reductions
• reductions
– compute single value from a stream
reduce void sum (float a<>,
reduce float r<>)
r += a;
}
SIGGRAPH 2004 21
reductions
• reductions
– compute single value from a stream
reduce void sum (float a<>,
reduce float r<>)
r += a;
}
float a<100>;
float r;
sum(a,r); r = a[0];
for (int i=1; i<100; i++)
r += a[i];
SIGGRAPH 2004 22
reductions
• reductions
– associative operations only
(a+b)+c = a+(b+c)
• sum, multiply, max, min, OR, AND, XOR
• matrix multiply
– permits parallel execution
SIGGRAPH 2004 23
system outline
brcc
source to source compiler
– generate CG & HLSL code
– CGC and FXC for shader
assembly
– virtualization
brt
Brook run-time library
– stream texture management
– kernel shader execution
SIGGRAPH 2004 24
eliminating GPU limitations
treating texture as memory
– limited texture size and dimension
– compiler inserts address translation code
float matrix<8096,10,30,5>;
SIGGRAPH 2004 25
eliminating GPU limitations
extending kernel outputs
– duplicate kernels, let cgc or fxc do dead code
elimination
– better solution:
"Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware”
Tim Foley, Mike Houston, and Pat Hanrahan
SIGGRAPH 2004 26
applications
ray-tracer segmentation
SAXPY
SGEMV
6
• limited data reuse
5
9 SAXPY
4 8 FFT
3 Pentium 4 3.0 GHz
44 GB/sec peak cache bandwidth
2
NVIDIA GeForce 6800 Ultra
1 36 GB/sec peak memory bandwidth
SAXPY FFT
SIGGRAPH 2004 29
evaluation
7
GPU wins when…
Relative Performance
6
• arithmetic intensity
5
9 Segment
4 3.7 ops per word
3 8 SGEMV
1/3 ops per word
2
Segment SGEMV
SIGGRAPH 2004 30
outperforming the CPU
considering GPU transfer costs: Tr
– computational intensity: γ
γ ≡ Kgpu / Tr
work per word transferred
SIGGRAPH 2004 31
efficiency
ATI Pentium 4
1
SIGGRAPH 2004 33
summary
GPU-based computing for the masses
bioinfomatics rendering
statistics
simulation
SIGGRAPH 2004 34
acknowledgements
• paper •language
– Bill Mark (UT-Austin) –Stanford Merrimac Group
– Nick Triantos, Tim Purcell (NVIDIA) –Reservoir Labs
– Mark Segal (ATI)
– Kurt Akeley
– Reviewers
• sponsors
– DARPA contract MDA904-98-R-S855, F29601-00-2-0085
– DOE ASC contract LLL-B341491
– NVIDIA, ATI, IBM, Sony
– Rambus Stanford Graduate Fellowship
– Stanford School of Engineering Fellowship
SIGGRAPH 2004 35
Brook for GPUs
• release v0.3 available on Sourceforge
• project page
– http://graphics.stanford.edu/projects/brook
• source
– http://www.sourceforge.net/projects/brook
• over 6K downloads!
• interested in collaborating?
SIGGRAPH 2004 36