0% found this document useful (0 votes)

22 views36 pages

Brook For GPUs - Stream Computing On Graphics Hardware - Slides (2004)

The document discusses Brook, a stream programming environment designed for GPU-based computing, which aims to simplify GPU programming and enhance performance. It highlights the advantages of GPUs over CPUs, particularly in terms of data parallelism and arithmetic intensity, and presents the architecture and features of the Brook language. The paper also includes evaluations of GPU performance in various applications, demonstrating its efficiency compared to traditional CPU implementations.

Uploaded by

sesquivels

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views36 pages

Brook For GPUs - Stream Computing On Graphics Hardware - Slides (2004)

Uploaded by

sesquivels

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

Brook for GPUs:

Stream Computing on Graphics Hardware

Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon

Fatahalian, Mike Houston, and Pat Hanrahan

Computer Science Department

Stanford University
recent trends
multiplies per second

NVIDIA NV30, 35, 40

GFLOPS

ATI R300, 360, 420

Pentium 4

July 01 Jan 02 July 02 Jan 03 July 03 Jan 04

SIGGRAPH 2004 2
recent trends
GPU-based SIGGRAPH/Graphics Hardware papers

July 01 Jan 02 July 02 Jan 03 July 03 Jan 04

SIGGRAPH 2004 3
domain specific solutions

map directly to graphics

primitives

requires extensive
knowledge of GPU
programming

SIGGRAPH 2004 4
building an abstraction

general GPU computing question

– can we simplify GPU
programming?

– what is the correct abstraction

for GPU-based computing?

– what is the scope of problems

that can be implemented
efficiently on the GPU?

SIGGRAPH 2004 5
contributions
• Brook stream programming environment
for GPU-based computing
– language, compiler, and runtime system

• virtualizing or extending GPU resources

• analysis of when GPUs outperform CPUs

SIGGRAPH 2004 6
GPU programming model
each fragment shaded independently
– no dependencies between fragments
• temporary registers are zeroed
• no static variables
• no read-modify-write textures
– multiple “pixel pipes”

SIGGRAPH 2004 7
GPU = data parallel
each fragment shaded independently
– no dependencies between fragments
• temporary registers are zeroed
• no static variables
• no read-modify-write textures
– multiple “pixel pipes”
data parallelism
– support ALU heavy architectures
– hide memory latency
[Torborg and Kajiya 96, Anderson et al. 97, Igehy et al. 98]

SIGGRAPH 2004 8
compute vs. bandwidth

GFLOPS

7x Gap

GFloats/sec

R300 R360 R420

ATI Hardware
SIGGRAPH 2004 9
compute vs. bandwidth
arithmetic intensity =
compute-to-bandwidth ratio

graphics pipeline
– vextex
• BW: 1 vertex = 32 bytes;
• OP: 100-500 f32-ops / vertex
– fragment
• BW: 1 fragment = 10 bytes
• OP: 300-1000 i8-ops/fragment
SIGGRAPH 2004 10
Brook language
stream programming model
– enforce data parallel computing
• streams
– encourage arithmetic intensity
• kernels

SIGGRAPH 2004 11
design goals
• general purpose computing
GPU = general streaming-coprocessor
• GPU-based computing for the masses
no graphics experience required
eliminating annoying GPU limitations
• performance
• platform independent
ATI & NVIDIA
DirectX & OpenGL
Windows & Linux

SIGGRAPH 2004 12
Other languages
• Cg / HLSL / OpenGL Shading Language
+ C-like language for expressing shader computation
– graphics execution model
– requires graphics API for data management and shader
execution
• Sh [McCool et al. '04]
+ functional approach for specifying shaders
• evolved from a shading language
• Connection Machine C*
• StreamIt, StreamC & KernelC, Ptolemy

SIGGRAPH 2004 13
Brook language
C with streams
• streams
– collection of records requiring similar computation
• particle positions, voxels, FEM cell, …

Ray r<200>;
float3 velocityfield<100,100,100>;

– data parallelism
• provides data to operate on in parallel

SIGGRAPH 2004 14
kernels
• kernels
– functions applied to streams
• similar to for_all construct
• no dependencies between stream elements

kernel void foo (float a<>, float b<>,

out float result<>) {
result = a + b;
}
float a<100>;
float b<100>;
float c<100>;
foo(a,b,c); for (i=0; i<100; i++)
c[i] = a[i]+b[i];

SIGGRAPH 2004 15
kernels
• kernels arguments
– input/output streams

kernel void foo (float a<>,

float b<>,
out float result<>) {
result = a + b;
}

SIGGRAPH 2004 16
kernels
• kernels arguments
– input/output streams
– gather streams

kernel void foo (..., float array[] ) {

a = array[i];
}

SIGGRAPH 2004 17
kernels
• kernels arguments
– input/output streams
– gather streams
– iterator streams

kernel void foo (..., iter float n<> ) {

a = n + b;
}

SIGGRAPH 2004 18
kernels
• kernels arguments
– input/output streams
– gather streams
– iterator streams
– constant parameters

kernel void foo (..., float c ) {

a = c + b;
}

SIGGRAPH 2004 19
kernels
why not allow direct
Ray-triangle intersection
array operators? kernel void

A+B*C
krnIntersectTriangle(Ray ray<>, Triangle tris[],
RayState oldraystate<>,
GridTrilist trilist[],
out Hit candidatehit<>) {
float idx, det, inv_det;
float3 edge1, edge2, pvec, tvec, qvec;
if(oldraystate.state.y > 0) {

– arithmetic intensity
idx = trilist[oldraystate.state.w].trinum;
edge1 = tris[idx].v1 - tris[idx].v0;
edge2 = tris[idx].v2 - tris[idx].v0;

• temporaries kept pvec = cross(ray.d, edge2);

det = dot(edge1, pvec);

local to computation
inv_det = 1.0f/det;
tvec = ray.o - tris[idx].v0;
candidatehit.data.y = dot( tvec, pvec );
qvec = cross( tvec, edge1 );
candidatehit.data.z = dot( ray.d, qvec );
candidatehit.data.x = dot( edge2, qvec );

– explicit
candidatehit.data.xyz *= inv_det;
candidatehit.data.w = idx;

communication
} else {
candidatehit.data = float4(0,0,0,-1);
}

• kernel arguments
}

SIGGRAPH 2004 20
reductions
• reductions
– compute single value from a stream
reduce void sum (float a<>,
reduce float r<>)
r += a;
}

SIGGRAPH 2004 21
reductions
• reductions
– compute single value from a stream
reduce void sum (float a<>,
reduce float r<>)
r += a;
}

float a<100>;
float r;
sum(a,r); r = a[0];
for (int i=1; i<100; i++)
r += a[i];

SIGGRAPH 2004 22
reductions
• reductions
– associative operations only
(a+b)+c = a+(b+c)
• sum, multiply, max, min, OR, AND, XOR
• matrix multiply
– permits parallel execution

SIGGRAPH 2004 23
system outline

brcc
source to source compiler
– generate CG & HLSL code
– CGC and FXC for shader
assembly
– virtualization

brt
Brook run-time library
– stream texture management
– kernel shader execution

SIGGRAPH 2004 24
eliminating GPU limitations
treating texture as memory
– limited texture size and dimension
– compiler inserts address translation code

float matrix<8096,10,30,5>;

SIGGRAPH 2004 25
eliminating GPU limitations
extending kernel outputs
– duplicate kernels, let cgc or fxc do dead code
elimination
– better solution:
"Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware”
Tim Foley, Mike Houston, and Pat Hanrahan

"Mio: Fast Multipass Partitioning via Priority-Based Instruction Scheduling"

Andrew T. Riffel, Aaron E. Lefohn, Kiril Vidimce, Mark Leone, and John
D. Owens

SIGGRAPH 2004 26
applications

ray-tracer segmentation
SAXPY

SGEMV

fft edge detect linear algebra

evaluation
7 ATI Radeon X800 XT compared against:
NVIDIA GeForce 6800
• Intel Math Library
Relative Performance

6 • Atlas Math Library

Pentium 4 3.0 GHz
• cached blocked segmentation
5 • FFTW
• Wald ['04] SSE Ray-Triangle
4

SAXPY Segment SGEMV FFT Ray-tracer

SIGGRAPH 2004 28
evaluation
7
GPU wins when…
Relative Performance

6
• limited data reuse
5
9 SAXPY
4 8 FFT
3 Pentium 4 3.0 GHz
44 GB/sec peak cache bandwidth
2
NVIDIA GeForce 6800 Ultra
1 36 GB/sec peak memory bandwidth

SAXPY FFT
SIGGRAPH 2004 29
evaluation
7
GPU wins when…
Relative Performance

6
• arithmetic intensity
5
9 Segment
4 3.7 ops per word

3 8 SGEMV
1/3 ops per word
2

Segment SGEMV
SIGGRAPH 2004 30
outperforming the CPU
considering GPU transfer costs: Tr
– computational intensity: γ
γ ≡ Kgpu / Tr
work per word transferred

considering CPU cost to issuing a kernel

SIGGRAPH 2004 31
efficiency

Brook version within 80% of hand-coded

GPU version
FF T
Relative Performance

ATI Pentium 4
1

Hand Brook Hand C

coded coded
SIGGRAPH 2004 32
summary
• GPUs are faster than CPUs
– and getting faster
• why?
– data parallelism
– arithmetic intensity
• what is the right programming model?
– Brook
– stream computing

SIGGRAPH 2004 33
summary
GPU-based computing for the masses

bioinfomatics rendering

statistics
simulation
SIGGRAPH 2004 34
acknowledgements
• paper •language
– Bill Mark (UT-Austin) –Stanford Merrimac Group
– Nick Triantos, Tim Purcell (NVIDIA) –Reservoir Labs
– Mark Segal (ATI)
– Kurt Akeley
– Reviewers
• sponsors
– DARPA contract MDA904-98-R-S855, F29601-00-2-0085
– DOE ASC contract LLL-B341491
– NVIDIA, ATI, IBM, Sony
– Rambus Stanford Graduate Fellowship
– Stanford School of Engineering Fellowship

SIGGRAPH 2004 35
Brook for GPUs
• release v0.3 available on Sourceforge
• project page
– http://graphics.stanford.edu/projects/brook
• source
– http://www.sourceforge.net/projects/brook
• over 6K downloads!
• interested in collaborating?

fly-fishing fly images from The English Fly Fishing Shop

SIGGRAPH 2004 36

Gpu Programming
100% (2)
Gpu Programming
96 pages
PGG - Print Sample - 093005 PDF
No ratings yet
PGG - Print Sample - 093005 PDF
1 page
10 - Introduction and Overview GPGPU
100% (1)
10 - Introduction and Overview GPGPU
69 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
GeoCLIM 3.1.0 QGIS Manual English
No ratings yet
GeoCLIM 3.1.0 QGIS Manual English
97 pages
CUDA 4 1 Webinar v11-11-22
100% (1)
CUDA 4 1 Webinar v11-11-22
41 pages
Resume Format in Word For Computer Operator
100% (1)
Resume Format in Word For Computer Operator
5 pages
Ray Tracing On GPU: University of Applied Sciences Basel (FHBB) Diploma Thesis
No ratings yet
Ray Tracing On GPU: University of Applied Sciences Basel (FHBB) Diploma Thesis
44 pages
A Guide To EV Slickline Memory Cameras
No ratings yet
A Guide To EV Slickline Memory Cameras
20 pages
CarSim8 Quick Start
No ratings yet
CarSim8 Quick Start
66 pages
CPU Structure and Function
100% (1)
CPU Structure and Function
30 pages
The Evolution of Gpus For General Purpose Computing
No ratings yet
The Evolution of Gpus For General Purpose Computing
38 pages
Pybullet Quickstart Guide: Erwin Coumans Yunfei Bai Forums
No ratings yet
Pybullet Quickstart Guide: Erwin Coumans Yunfei Bai Forums
66 pages
Restaurent Project
No ratings yet
Restaurent Project
15 pages
Bachelor Thesis Example Science
100% (3)
Bachelor Thesis Example Science
8 pages
Luong Thesis
No ratings yet
Luong Thesis
81 pages
14 Parallel Algorithms CUDA Basics s20
No ratings yet
14 Parallel Algorithms CUDA Basics s20
89 pages
07 Gpuarch
No ratings yet
07 Gpuarch
73 pages
Chapter4 Thread Questions With Answers
No ratings yet
Chapter4 Thread Questions With Answers
5 pages
NTNU HetComp Topublish PDF
No ratings yet
NTNU HetComp Topublish PDF
83 pages
s7122 Stephen Jones Cuda Optimization Tips Tricks and Techniques
No ratings yet
s7122 Stephen Jones Cuda Optimization Tips Tricks and Techniques
71 pages
Understanding The Graphics Pipeline
No ratings yet
Understanding The Graphics Pipeline
35 pages
PowerPoint 2016 Module 3 PPT Presentation
No ratings yet
PowerPoint 2016 Module 3 PPT Presentation
7 pages
06 Gpuarch
No ratings yet
06 Gpuarch
78 pages
MSC Nastran 2023.4 High Performance Computing User Guide
No ratings yet
MSC Nastran 2023.4 High Performance Computing User Guide
170 pages
C4-IEEE ParallelRT
No ratings yet
C4-IEEE ParallelRT
8 pages
Introduction CUDA
No ratings yet
Introduction CUDA
46 pages
GPGPU
No ratings yet
GPGPU
139 pages
Ray Tracing On GPU
No ratings yet
Ray Tracing On GPU
44 pages
Brook For GPUs - Stream Computing On Graphics Hardware - Paper
No ratings yet
Brook For GPUs - Stream Computing On Graphics Hardware - Paper
10 pages
Owens
No ratings yet
Owens
67 pages
Microsoft Word Shortcut Keys
No ratings yet
Microsoft Word Shortcut Keys
31 pages
Parallel Hashing: John Erol Evangelista
No ratings yet
Parallel Hashing: John Erol Evangelista
42 pages
Chapter 9 - Multiple Core Computers
No ratings yet
Chapter 9 - Multiple Core Computers
44 pages
Lecture 17-Introduction To GPU
No ratings yet
Lecture 17-Introduction To GPU
36 pages
Data Parallel Computation
No ratings yet
Data Parallel Computation
9 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
44 pages
Introduction To Graphics Hardware and Gpus Introduction To Graphics Hardware and Gpus
No ratings yet
Introduction To Graphics Hardware and Gpus Introduction To Graphics Hardware and Gpus
22 pages
Accelerating Large Graph Algorithms On The GPU Using Cuda
No ratings yet
Accelerating Large Graph Algorithms On The GPU Using Cuda
12 pages
Brodtkorb Etal Meta10
No ratings yet
Brodtkorb Etal Meta10
15 pages
CH 5 Digital Presentation MCQ Important
No ratings yet
CH 5 Digital Presentation MCQ Important
16 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
Pawan 09 Graph Algorithms
No ratings yet
Pawan 09 Graph Algorithms
26 pages
Arts 10 Q2 M1 Technology Based Art 1
No ratings yet
Arts 10 Q2 M1 Technology Based Art 1
51 pages
Seminar Igor Kamzic COSC3P93
No ratings yet
Seminar Igor Kamzic COSC3P93
58 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
Introduction To GP-GPU and CUDA: High Performance Computing Center Hanoi University of Science & Technology
No ratings yet
Introduction To GP-GPU and CUDA: High Performance Computing Center Hanoi University of Science & Technology
43 pages
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
Cks 2012 It Art 002
No ratings yet
Cks 2012 It Art 002
10 pages
EE5902R Chapter 1 Slides
No ratings yet
EE5902R Chapter 1 Slides
46 pages
Parallelization of BFS Graph Algorithm Using CUDA
No ratings yet
Parallelization of BFS Graph Algorithm Using CUDA
6 pages
Chapter Book
No ratings yet
Chapter Book
30 pages
Report Final
No ratings yet
Report Final
20 pages
Accelerating CFD Simulations With Gpus: Patrice Castonguay
No ratings yet
Accelerating CFD Simulations With Gpus: Patrice Castonguay
67 pages
GPU Gems2 ch29
No ratings yet
GPU Gems2 ch29
21 pages
Khan Muhammad Nafee Mostafa: Presented by
No ratings yet
Khan Muhammad Nafee Mostafa: Presented by
20 pages
Scan Primitives
No ratings yet
Scan Primitives
11 pages
Cuda Opencl
No ratings yet
Cuda Opencl
17 pages
Linux WSL Key Evidence Examples
No ratings yet
Linux WSL Key Evidence Examples
33 pages
PostgreSQL OpenCL Procedural Language
No ratings yet
PostgreSQL OpenCL Procedural Language
29 pages
GPU Quicksort
No ratings yet
GPU Quicksort
22 pages
Evolution of The Graphics Process Units: Dr. Zhijie Xu Z.xu@hud - Ac.uk
No ratings yet
Evolution of The Graphics Process Units: Dr. Zhijie Xu Z.xu@hud - Ac.uk
24 pages
Veljko Milutinović: University of Belgrade
No ratings yet
Veljko Milutinović: University of Belgrade
42 pages
General Information
No ratings yet
General Information
14 pages
Gribble 08 Ray
No ratings yet
Gribble 08 Ray
11 pages
Fpga Implementation of A License Plate Recognition Soc Using Automatically Generated Streaming Accelerators
No ratings yet
Fpga Implementation of A License Plate Recognition Soc Using Automatically Generated Streaming Accelerators
8 pages
Unit 2 - GPU DFG
No ratings yet
Unit 2 - GPU DFG
27 pages
Gpu-Accelerated Face Detection Algorithm
No ratings yet
Gpu-Accelerated Face Detection Algorithm
9 pages
Graphics Processing Unit GPU Programming Strategie
No ratings yet
Graphics Processing Unit GPU Programming Strategie
14 pages
Maths
No ratings yet
Maths
12 pages
Gpgpu Workshop Cuda
No ratings yet
Gpgpu Workshop Cuda
10 pages
A New Approach For Parallel Region Growing Algorithm in Image Segmentation Using MATLAB On GPU Architecture
No ratings yet
A New Approach For Parallel Region Growing Algorithm in Image Segmentation Using MATLAB On GPU Architecture
5 pages
Chapter 1
No ratings yet
Chapter 1
7 pages
SSHA Subsystem Hazard Analysis
No ratings yet
SSHA Subsystem Hazard Analysis
24 pages
Citra Log - Txt.old
No ratings yet
Citra Log - Txt.old
8 pages
Graphics Processing Unit (GPU) Programming Strategies and Trends in GPU Computing
No ratings yet
Graphics Processing Unit (GPU) Programming Strategies and Trends in GPU Computing
10 pages
Bandwidth Intensive 3-D FFT Kernel For Gpus Using Cuda: Akira Nukada, Yasuhiko Ogata, Toshio Endo, Satoshi Matsuoka
No ratings yet
Bandwidth Intensive 3-D FFT Kernel For Gpus Using Cuda: Akira Nukada, Yasuhiko Ogata, Toshio Endo, Satoshi Matsuoka
11 pages
Lasa Abstraction and Specialization For Productive and Performant Linear Algebra On FPGAs
No ratings yet
Lasa Abstraction and Specialization For Productive and Performant Linear Algebra On FPGAs
7 pages
Orica SHOTPlus™ Underground Flyer
No ratings yet
Orica SHOTPlus™ Underground Flyer
2 pages
Lab M9
No ratings yet
Lab M9
14 pages
Understanding The Efficiency of GPU Algorithms For Matrix-Matrix Multiplication
No ratings yet
Understanding The Efficiency of GPU Algorithms For Matrix-Matrix Multiplication
5 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
5 pages
Untitled
No ratings yet
Untitled
3 pages
Cervix Visionator ELM A Novel Approach To Early Detection of Cervical Cancer
No ratings yet
Cervix Visionator ELM A Novel Approach To Early Detection of Cervical Cancer
6 pages
Best Desktop Under Rs 50k
No ratings yet
Best Desktop Under Rs 50k
1 page
8 Things You Should Know About GPGPU Technology: Q&A With TACC Research Scientists
No ratings yet
8 Things You Should Know About GPGPU Technology: Q&A With TACC Research Scientists
2 pages
Handout mp2-1
No ratings yet
Handout mp2-1
5 pages
Isaacv1 7 9b J839
No ratings yet
Isaacv1 7 9b J839
3 pages
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
MARIO FRANCO
No ratings yet
Scanline Rendering: Exploring Visual Realism Through Scanline Rendering Techniques
From Everand
Scanline Rendering: Exploring Visual Realism Through Scanline Rendering Techniques
Fouad Sabry
No ratings yet

Brook For GPUs - Stream Computing On Graphics Hardware - Slides (2004)

Uploaded by

Brook For GPUs - Stream Computing On Graphics Hardware - Slides (2004)

Uploaded by

Brook for GPUs:

Stream Computing on Graphics Hardware

Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon

Computer Science Department

NVIDIA NV30, 35, 40

ATI R300, 360, 420

July 01 Jan 02 July 02 Jan 03 July 03 Jan 04

July 01 Jan 02 July 02 Jan 03 July 03 Jan 04

map directly to graphics

general GPU computing question

– what is the correct abstraction

– what is the scope of problems

• virtualizing or extending GPU resources

• analysis of when GPUs outperform CPUs

R300 R360 R420

kernel void foo (float a<>, float b<>,

kernel void foo (float a<>,

kernel void foo (..., float array[] ) {

kernel void foo (..., iter float n<> ) {

kernel void foo (..., float c ) {

• temporaries kept pvec = cross(ray.d, edge2);

"Mio: Fast Multipass Partitioning via Priority-Based Instruction Scheduling"

fft edge detect linear algebra

6 • Atlas Math Library

SAXPY Segment SGEMV FFT Ray-tracer

considering CPU cost to issuing a kernel

Brook version within 80% of hand-coded

Hand Brook Hand C

fly-fishing fly images from The English Fly Fishing Shop

You might also like