0% found this document useful (0 votes)
22 views36 pages

Brook For GPUs - Stream Computing On Graphics Hardware - Slides (2004)

The document discusses Brook, a stream programming environment designed for GPU-based computing, which aims to simplify GPU programming and enhance performance. It highlights the advantages of GPUs over CPUs, particularly in terms of data parallelism and arithmetic intensity, and presents the architecture and features of the Brook language. The paper also includes evaluations of GPU performance in various applications, demonstrating its efficiency compared to traditional CPU implementations.

Uploaded by

sesquivels
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views36 pages

Brook For GPUs - Stream Computing On Graphics Hardware - Slides (2004)

The document discusses Brook, a stream programming environment designed for GPU-based computing, which aims to simplify GPU programming and enhance performance. It highlights the advantages of GPUs over CPUs, particularly in terms of data parallelism and arithmetic intensity, and presents the architecture and features of the Brook language. The paper also includes evaluations of GPU performance in various applications, demonstrating its efficiency compared to traditional CPU implementations.

Uploaded by

sesquivels
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Brook for GPUs:

Stream Computing on Graphics Hardware

Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon


Fatahalian, Mike Houston, and Pat Hanrahan

Computer Science Department


Stanford University
recent trends
multiplies per second

NVIDIA NV30, 35, 40


GFLOPS

ATI R300, 360, 420

Pentium 4

July 01 Jan 02 July 02 Jan 03 July 03 Jan 04

SIGGRAPH 2004 2
recent trends
GPU-based SIGGRAPH/Graphics Hardware papers

13

July 01 Jan 02 July 02 Jan 03 July 03 Jan 04

SIGGRAPH 2004 3
domain specific solutions

map directly to graphics


primitives

requires extensive
knowledge of GPU
programming

SIGGRAPH 2004 4
building an abstraction

general GPU computing question


– can we simplify GPU
programming?

– what is the correct abstraction


for GPU-based computing?

– what is the scope of problems


that can be implemented
efficiently on the GPU?

SIGGRAPH 2004 5
contributions
• Brook stream programming environment
for GPU-based computing
– language, compiler, and runtime system

• virtualizing or extending GPU resources

• analysis of when GPUs outperform CPUs

SIGGRAPH 2004 6
GPU programming model
each fragment shaded independently
– no dependencies between fragments
• temporary registers are zeroed
• no static variables
• no read-modify-write textures
– multiple “pixel pipes”

SIGGRAPH 2004 7
GPU = data parallel
each fragment shaded independently
– no dependencies between fragments
• temporary registers are zeroed
• no static variables
• no read-modify-write textures
– multiple “pixel pipes”
data parallelism
– support ALU heavy architectures
– hide memory latency
[Torborg and Kajiya 96, Anderson et al. 97, Igehy et al. 98]

SIGGRAPH 2004 8
compute vs. bandwidth

GFLOPS

7x Gap

GFloats/sec

R300 R360 R420


ATI Hardware
SIGGRAPH 2004 9
compute vs. bandwidth
arithmetic intensity =
compute-to-bandwidth ratio

graphics pipeline
– vextex
• BW: 1 vertex = 32 bytes;
• OP: 100-500 f32-ops / vertex
– fragment
• BW: 1 fragment = 10 bytes
• OP: 300-1000 i8-ops/fragment
SIGGRAPH 2004 10
Brook language
stream programming model
– enforce data parallel computing
• streams
– encourage arithmetic intensity
• kernels

SIGGRAPH 2004 11
design goals
• general purpose computing
GPU = general streaming-coprocessor
• GPU-based computing for the masses
no graphics experience required
eliminating annoying GPU limitations
• performance
• platform independent
ATI & NVIDIA
DirectX & OpenGL
Windows & Linux

SIGGRAPH 2004 12
Other languages
• Cg / HLSL / OpenGL Shading Language
+ C-like language for expressing shader computation
– graphics execution model
– requires graphics API for data management and shader
execution
• Sh [McCool et al. '04]
+ functional approach for specifying shaders
• evolved from a shading language
• Connection Machine C*
• StreamIt, StreamC & KernelC, Ptolemy

SIGGRAPH 2004 13
Brook language
C with streams
• streams
– collection of records requiring similar computation
• particle positions, voxels, FEM cell, …

Ray r<200>;
float3 velocityfield<100,100,100>;

– data parallelism
• provides data to operate on in parallel

SIGGRAPH 2004 14
kernels
• kernels
– functions applied to streams
• similar to for_all construct
• no dependencies between stream elements

kernel void foo (float a<>, float b<>,


out float result<>) {
result = a + b;
}
float a<100>;
float b<100>;
float c<100>;
foo(a,b,c); for (i=0; i<100; i++)
c[i] = a[i]+b[i];

SIGGRAPH 2004 15
kernels
• kernels arguments
– input/output streams

kernel void foo (float a<>,


float b<>,
out float result<>) {
result = a + b;
}

SIGGRAPH 2004 16
kernels
• kernels arguments
– input/output streams
– gather streams

kernel void foo (..., float array[] ) {


a = array[i];
}

SIGGRAPH 2004 17
kernels
• kernels arguments
– input/output streams
– gather streams
– iterator streams

kernel void foo (..., iter float n<> ) {


a = n + b;
}

SIGGRAPH 2004 18
kernels
• kernels arguments
– input/output streams
– gather streams
– iterator streams
– constant parameters

kernel void foo (..., float c ) {


a = c + b;
}

SIGGRAPH 2004 19
kernels
why not allow direct
Ray-triangle intersection
array operators? kernel void

A+B*C
krnIntersectTriangle(Ray ray<>, Triangle tris[],
RayState oldraystate<>,
GridTrilist trilist[],
out Hit candidatehit<>) {
float idx, det, inv_det;
float3 edge1, edge2, pvec, tvec, qvec;
if(oldraystate.state.y > 0) {

– arithmetic intensity
idx = trilist[oldraystate.state.w].trinum;
edge1 = tris[idx].v1 - tris[idx].v0;
edge2 = tris[idx].v2 - tris[idx].v0;

• temporaries kept pvec = cross(ray.d, edge2);


det = dot(edge1, pvec);

local to computation
inv_det = 1.0f/det;
tvec = ray.o - tris[idx].v0;
candidatehit.data.y = dot( tvec, pvec );
qvec = cross( tvec, edge1 );
candidatehit.data.z = dot( ray.d, qvec );
candidatehit.data.x = dot( edge2, qvec );

– explicit
candidatehit.data.xyz *= inv_det;
candidatehit.data.w = idx;

communication
} else {
candidatehit.data = float4(0,0,0,-1);
}

• kernel arguments
}

SIGGRAPH 2004 20
reductions
• reductions
– compute single value from a stream
reduce void sum (float a<>,
reduce float r<>)
r += a;
}

SIGGRAPH 2004 21
reductions
• reductions
– compute single value from a stream
reduce void sum (float a<>,
reduce float r<>)
r += a;
}

float a<100>;
float r;
sum(a,r); r = a[0];
for (int i=1; i<100; i++)
r += a[i];

SIGGRAPH 2004 22
reductions
• reductions
– associative operations only
(a+b)+c = a+(b+c)
• sum, multiply, max, min, OR, AND, XOR
• matrix multiply
– permits parallel execution

SIGGRAPH 2004 23
system outline

brcc
source to source compiler
– generate CG & HLSL code
– CGC and FXC for shader
assembly
– virtualization

brt
Brook run-time library
– stream texture management
– kernel shader execution

SIGGRAPH 2004 24
eliminating GPU limitations
treating texture as memory
– limited texture size and dimension
– compiler inserts address translation code

float matrix<8096,10,30,5>;

SIGGRAPH 2004 25
eliminating GPU limitations
extending kernel outputs
– duplicate kernels, let cgc or fxc do dead code
elimination
– better solution:
"Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware”
Tim Foley, Mike Houston, and Pat Hanrahan

"Mio: Fast Multipass Partitioning via Priority-Based Instruction Scheduling"


Andrew T. Riffel, Aaron E. Lefohn, Kiril Vidimce, Mark Leone, and John
D. Owens

SIGGRAPH 2004 26
applications

ray-tracer segmentation
SAXPY

SGEMV

fft edge detect linear algebra


evaluation
7 ATI Radeon X800 XT compared against:
NVIDIA GeForce 6800
• Intel Math Library
Relative Performance

6 • Atlas Math Library


Pentium 4 3.0 GHz
• cached blocked segmentation
5 • FFTW
• Wald ['04] SSE Ray-Triangle
4

SAXPY Segment SGEMV FFT Ray-tracer


SIGGRAPH 2004 28
evaluation
7
GPU wins when…
Relative Performance

6
• limited data reuse
5
9 SAXPY
4 8 FFT
3 Pentium 4 3.0 GHz
44 GB/sec peak cache bandwidth
2
NVIDIA GeForce 6800 Ultra
1 36 GB/sec peak memory bandwidth

SAXPY FFT
SIGGRAPH 2004 29
evaluation
7
GPU wins when…
Relative Performance

6
• arithmetic intensity
5
9 Segment
4 3.7 ops per word

3 8 SGEMV
1/3 ops per word
2

Segment SGEMV
SIGGRAPH 2004 30
outperforming the CPU
considering GPU transfer costs: Tr
– computational intensity: γ
γ ≡ Kgpu / Tr
work per word transferred

considering CPU cost to issuing a kernel

SIGGRAPH 2004 31
efficiency

Brook version within 80% of hand-coded


GPU version
FF T
Relative Performance

ATI Pentium 4
1

Hand Brook Hand C


coded coded
SIGGRAPH 2004 32
summary
• GPUs are faster than CPUs
– and getting faster
• why?
– data parallelism
– arithmetic intensity
• what is the right programming model?
– Brook
– stream computing

SIGGRAPH 2004 33
summary
GPU-based computing for the masses

bioinfomatics rendering

statistics
simulation
SIGGRAPH 2004 34
acknowledgements
• paper •language
– Bill Mark (UT-Austin) –Stanford Merrimac Group
– Nick Triantos, Tim Purcell (NVIDIA) –Reservoir Labs
– Mark Segal (ATI)
– Kurt Akeley
– Reviewers
• sponsors
– DARPA contract MDA904-98-R-S855, F29601-00-2-0085
– DOE ASC contract LLL-B341491
– NVIDIA, ATI, IBM, Sony
– Rambus Stanford Graduate Fellowship
– Stanford School of Engineering Fellowship

SIGGRAPH 2004 35
Brook for GPUs
• release v0.3 available on Sourceforge
• project page
– http://graphics.stanford.edu/projects/brook
• source
– http://www.sourceforge.net/projects/brook
• over 6K downloads!
• interested in collaborating?

fly-fishing fly images from The English Fly Fishing Shop

SIGGRAPH 2004 36

You might also like