0% found this document useful (0 votes)
2 views

PLDI08Tutorial

The document outlines a tutorial on building a high-level language compiler for General-Purpose Computation Using GPU (GPGPU), focusing on the architecture and capabilities of AMD GPUs, particularly the R670 model. It discusses programming models, the AMD Compute Abstraction Layer (CAL), and provides insights into kernel development and execution. The agenda includes topics such as an introduction to GPGPU, the Brook+ programming language, and a case study on matrix multiplication.

Uploaded by

bitbotics
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

PLDI08Tutorial

The document outlines a tutorial on building a high-level language compiler for General-Purpose Computation Using GPU (GPGPU), focusing on the architecture and capabilities of AMD GPUs, particularly the R670 model. It discusses programming models, the AMD Compute Abstraction Layer (CAL), and provides insights into kernel development and execution. The agenda includes topics such as an introduction to GPGPU, the Brook+ programming language, and a case study on matrix multiplication.

Uploaded by

bitbotics
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Building a High Level Language

Compiler For GPGPU

Bixia Zheng ([email protected])


Derek Gladding ([email protected])
Micah Villmow ([email protected])

June 8th, 2008

Agenda

Time Topic

1:30PM – 2:30PM Introduction to General-Purpose


Computation Using GPU (GPGPU)

2:40PM – 3:50PM Brook+: The Programming Language


and Compiler Implementation

4:00PM – 5:00PM Matrix Multiplication – A Case Study

Folding@Home Demo

2 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

1
Introduction to GPGPU

3 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

Outline

What is a GPU?
- Qualitative overview
- Tour of current AMD GPU architecture

How to use GPUs for general-purpose computation?


- Programming model considerations
- Tour of AMD’s Compute Abstraction Layer

4 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

2
CPU vs. GPU

CPU GPU

Fast, agile Not so fast, not so nimble

but but

Lower carrying capacity Carries a lot at a time

= optimised for low latency = optimised for throughput

5 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

What are GPUs good at?

- Large data sets

- Latency-tolerant computations

- Minimal control flow or recursion

- High locality

- High compute/IO ratio

6 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

3
How good are they?

Potential for significant speedup for data


parallel problems
– Basic: 5-10x
– Tuned: 20-100x or even more

Up to 2 orders of magnitude better in


several key metrics vs. current CPUs
– Memory bandwidth
– GFLOPS per Watt
– GFLOPS per $
7 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

A Concrete Example: AMD R670


Graphics

500 GF (single precision), 100 GF (double precision),


75 GB/sec, up to 2G RAM on-card,
Mainstream version suggested retail below $199
This is the previous generation of GPU technology
8 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

4
R670 Architecture (1)
- ~700 million transistors
- 75 GB/s memory bandwidth
- 256b DDR3/4 interface
- Targeted for handling thousands of
simultaneous lightweight threads
- Instruction cache and constant cache for
unlimited program size
- Scalar ALU implementation with 320
(64x5) independent stream processors
- 256 (64x4) basic units (FMAD)
- 64 enhanced transcendental units
(COS, LOG, EXP, RSQ, etc.)
- Directly supports int, unsigned int, float
and double
- Collection of graphics-specific units
surrounding a general-purpose compute
core

9 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

R670 Architecture (2) – Scalar


Elements

01+- 01+- 01+- 01+- 01+- 01+-

x x
transcendental
pipe

+ +

Basic scalar element Extended scalar element


(fused multiply-add) (FMAD plus transcendental pipe)

256 of these 64 of these

10 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

5
R670 Architecture (3) – Processing
Element

GPRs

Input
Mux

01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
+- +- +- +- +- +- +- +- +- +- +- +- +- +- +-

ALU

Output
Mux

64 of these

11 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

R670 Architecture (4) - SIMD

0 1 ... 15

GPRs
(256
high)

ALU ...

4 of these

12 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

6
R670 Architecture (5) – Compute
Data Path

Fetch

SIMD0

100s of clocks
SIMD1

Memory

SIMD2

SIMD3

Output

13 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

R670 Architecture (6) – Threads


and Wavefronts

Previous slides sketched out the data hierarchy, but what


about control?

Smallest unit of control is the thread.

These are grouped into wavefronts.


Thread

Thread

Thread

Thread

Thread

Thread

Thread

Thread

Thread

Thread

Thread

Thread

Thread

Thread

Thread

Thread

SIMD

Wavefront

14 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

7
R670 Architecture (7) –
Conditionals via Predication
Only one “real” PC per wavefront, but threads are
automatically predicated – for if-then-else constructs, the
hardware will execute both sides of the branch if required.

test? 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 0 predicate mask

foo “then” side of branch

bar “else” side of branch

result is as if each thread


were tested serially

The hardware will even handle the recursive application of this.


15 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

R670 Architecture (8) –


Recursion

Limited hardware support for recursion:

32 entry stack, used for both

- function calls

and

- predicating branches

16 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

8
R670 Architecture (9) –
Multithreading
Memory fetch latency is quite high, so R670 is heavily
multithreaded to compensate.
Wavefront 1

Wavefront 2

Wavefront 3

Wavefront 4

Fetch unit
utilisation

ALU utilisation

17 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

R670 Architecture (10) –


Thread Register Usage
Threads share a common GPR pool – max # of threads is
determined by per-thread register usage.
L0 P0
L1 P1
Wavefront 0
L2 P2
L3 P3
L0 P4
L1 P5
Wavefront 1
L2 P6
L3 P7
L0 P8
L1 P9
Wavefront 2
L2 P10
L3 P11
L0 P12
L1 P13
Wavefront 3
L2 P14
L3 P15

L0 P252
L1 P253
Wavefront 63
L2 P254
L3 P255

Therefore, the more registers per thread, the less threads available
for latency compensation.

18 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

9
R670 Architecture (11) - Numerics

Integers

Single- and Double- Precision IEEE-754 floating point,


except:

- round-to-even only

- no denorms

- QNAN but not SNAN

- transcendentals optimised for speed not precision


(high precision needs extra refinement iteration)

19 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

R670 Architecture (12) - Summary

- SPMD (or “SIMD-inside-MIMD”)


- But automatic predication hides a lot of SIMD issues

- 320 fused multiply-add, 64 transcendental units

- Numerics far better than previous GPU generations

- Optimised for bandwidth, not latency

- Massively multithreaded to provide latency


compensation
- But watch out for register resource limits
- Performance heavily dependent on using lots of threads – large
data sets needed

20 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

10
How to Program GPU for
General-Purpose Computation?

21 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

GPGPU Programming Model

Any successful programming model has to consider:


– ISA and architecture changes every six-twelve months
– Machine is very wide (both data and instructions)
– Caches tend to be simpler than CPU caches
– High bandwidth, low latency
– Threading issues
– Control flow
– Tiny stack
– Huge register set
– Lots of special-case graphics-oriented features on the chip

22 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

11
Survey of GPGPU programming
approaches
Driver Model + Shader/Program/Kernel
– GLSL / HLSL / OpenGL / DirectX®

Unified Single Source for CPU/GPU


– Brook/CUDA
– Hides driver API from programmer

Embedded Runtime and Libraries


– Peakstream/Rapidmind
– Built on top of drivers and intermediate languages
– Microsoft Accelerator

23 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

AMD Compute Abstraction Layer (CAL)

Designed to provide a forward-compatible, cross-


platform interface to AMD GPUs

Aimed at tool vendors and programmers looking for low-


level access to the GPU

Extensions to CAL provide opportunities for device


specific optimization

24 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

12
Where does CAL fit in?

CAL takes this … … and turns it into this

25 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

CAL Computation Model

CAL effectively turns the GPU into a


virtualised giant SIMD compute array
(but with arbitrary output writes). T0,0 T1,0 … Tx-1,0

T0,1 T1,1 … Tx-1,1

CAL Kernel Execution Domain


Ti,j

T0,y-1T1,y-1 Tx-1,y-1

InputBuffer TP0 … TPk … TPn-1


OutputBuffer

Scheduler
ConstBuffer GlobalBuffer

Processing Cores

26 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

13
Writing A CAL Application

Write a CAL kernel

Initialize CAL, open a connection to a CAL device

Compile and load the kernel

Allocate CAL memory

Prepare input data in CAL memory

Bind CAL memory to input and output buffers

Specify execution domain, execute the CAL kernel

27 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

CAL API
Device management
– Open/Close and manage multiple devices

Context Management

Memory management
– Access to local GPU memory and remote system memory
– Ability to read/write directly to remote system memory

Code generation and optimization


– Compile CAL Kernel to optimized ISA for specified HW
– Option for offline or online compilation

Program execution
– Module loading
– Input/output binding

=> Asynchronous behavior: Overlap CPU computation, GPU computation


and data transfers

28 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

14
CAL Kernel
The program executes on the GPU
– Language
• AMD IL
– Compiler
• GPU JIT compiler
• Optimized GPU ISA instructions
– Input/Output
CAL Memory

Execution Domain
– A region in the OutBuffer
• (x0, y0) (w, h)
– CAL Kernel is invoked once for each element in the region
• Output to the element
• vPos represents the index of the element

29 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

AMD IL

The language to write CAL Kernel

A portable intermediate language for AMD GPUs

Resembles DirectX® assembly

Instruction syntax
<opcode>[_<ctrl>][_<ctrl(val)>] <= opcode with specifiers

[<dst>[_<mod>][.<write-mask>]] <= dst with modifier/mask

[, <src>[_<mod>][.<swizzle-mask>]] <= src with modifier/mask

30 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

15
IL Instruction Opcodes
Declaration and initialization
– CAL Memory (input, constant buffer)
– Literals

Read/write CAL inputBuffer


– sample_resource(n)_sampler(n) dst, src

ALU instructions
– float, double, int
– Comparison, bit-wise, arithmetic, trigonometirc

Type conversion instructions


– d2f, f2d,ftoi, itof, …

Flow control
– if, whileloop, call, …

Others
– mov, cmov, …

31 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

IL Instruction Operands
Memory “array”
– globalBuffer: g
– constBuffer: cb0, cb1, …

Special registers
– Implicitly write element in the outputBuffer: o0, o1, …
– Interpolated values: v0, v1, …

Virtual GPRs
– r0, r1, …

Literal constants
– l0, l1, …

Four components, typeless


– Write masks specify the components to write
– Swizzle specify the components to use

mov r1.x_z_, r2.y0w0

32 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

16
CAL Summary

CAL presents a clean, compute-centric abstraction of a GPU:

– Complete set of device management tools provided


• Includes multi-device support

– Communication primitives for moving data to/from the host

– Threads are structured as a 2D array


• (but can do arbitrary memory access)

– Programs written in high-level, portable assembler


• (but you can work down on the bare metal if you want)

33 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

Agenda

Time Topic

1:30PM – 2:30PM Introduction to General-Purpose


Computation Using GPU (GPGPU)

2:40PM – 3:50PM Brook+: The Programming Language


and Compiler Implementation

4:00PM – 5:00PM Matrix Multiplication – A Case Study

Folding@Home Demo

34 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

17
Building Tools With CAL

Add a CAL backend to existing parallelizing compilers


– OpenMP, Fortran, auto-parallelization

Implement performance libraries with CAL


– AMD Core Math Library (ACML)

Implement an existing domain specific high level


language with CAL
– Brook+

35 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

BrookGPU and Brook+

BrookGPU
– An open source implementation of an ANSIC C like stream
programming language on GPU (Stanford University, [3])

Brook+
– BrookGPU with enhancement to enable AMD GPU features
• CAL backend
• More data types: int, double
• Beyond pure stream model: gather, scatter

36 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

18
Outline

The Brook+ Programming Language

The Brook+ Compiler Implementation


- Code Generator for CAL
- The Compiler and Runtime Interface

Summary

37 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

The Brook+ Language

38 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

19
Brook+ Language Constructs For
Stream Programming
Stream
– A collection of data to be operated on in parallel

Kernel
– A function that operates on stream elements

Intrinsic functions for copying data


– streamRead: regular array => stream
– streamWrite: stream => regular array

39 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

A Simple Brook+ Program

#include <stdio.h>

kernel void sum (float a<>, float b<>, out float c<>)
{
Kernel
c = a + b;
}

int main()
{
float data[4] = {1.0, 2.0, 3.0, 4.0};
float a<4>, c<4>; Streams
int i;

streamRead(a, data);
Intrinsic functions
sum(a, a, c);
for copying data
streamWrite(c, data);

for (i = 0; i < 4; i++)


printf("%f ", data[i]);
printf("\n");
return 0;
}
40 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

20
Defining a Stream Variable

Examples
float a<4>;
int b<10>;
double c<20>;

Syntax
elementType variableName streamShape

41 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

Stream Element Types

Basic types
– float
– int (AMD extension)
– double (AMD extension)

Short vector types (components: x, y, z, w)


– float2, float3, float4
– int2, int3, int4 (AMD extension)
– double2 (AMD extension)

Struct aggregate of the above “scalar” types

42 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

21
Stream Shapes

Up to four dimension
– 1D: <n1>
– 2D: <n2, n1>
– 3D: <n3, n2, n1>
– 4D: <n4, n3, n2, n1>

Element indexing similar to C array indexing

Limitation on stream shapes is defined by


implementation

43 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

Kernel

Examples
kernel void sum (…) {…}

Two kinds of kernels


• Map kernel
kernel void kernelName (…) {…}
• Reduction kernel
reduce [kernel] void kernelName (…) {…}

44 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

22
Kernel Parameters: Input Streams

kernel void sum (float a<>, float b<>, out float c<>)
{
c = a + b;
}

A read only stream with implicit read pattern


– Declaration syntax: float a<>
– One element is implicitly read by each kernel invocation
– The input stream name is used to refer to this element

45 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

Kernel Parameters: Output Streams

kernel void sum (float a<>, float b<>, out float c<>)
{
c = a + b;
}

A write only stream with implicit write pattern


– Declaration Syntax: out float c<>
reduce float c<>
– One element is written by each kernel invocation
– The output stream name is used to refer to this element

46 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

23
Calling A Map Kernel

Invoke the kernel for each element in the output streams


– Simple case: the input streams and output streams have the
same shape

a 0 1 2 3

kernel void
sum (float a<>, float b<>, out float c<>) b 10 11 12 13

{
c = a + b;
}

c 10 12 14 16

47 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

Reduction Kernel

Compute an output element from a group of input


elements (input stream=>smaller output stream)
– Associative computation
• Users to ensure
– One input stream
– One output stream

Reduction kernel is designed to be simple


– Simplify compiler/runtime implementation
– Users to separate the computation that computes the
elements for reduction from the reduction process

48 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

24
Calling a Reduction Kernel

Apply the kernel to n input elements to compute one


output elements, where n=reductionFactor

reductionFactor * |output stream| = |input stream|

a 0 1 2 3 4 5 6 7
reduce void
sumR (float a<>, reduce float c<>)
{
c = c + a; // c += a;
}

c 6 22

Apply the above 1-dimension rule to each dimension

49 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

Non-Stream Parameters

Constant parameter: a read only scalar variable


– Declaration syntax: float b
– Limited to valid stream element types

Reduction kernel output can be a scalar value


– Declaration syntax: reduce float c
– Syntactic sugar for “always reduce to output stream with
one element”
– Limited to valid stream element types

50 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

25
Random Access Stream Parameters

Gather stream: a read only stream, read via array


indexing syntax
– Declaration syntax: float a[]

Scatter stream: a write only stream, write via array


indexing syntax (AMD extension)
– Declaration Syntax: out float4 c[]

=> Beyond pure stream model

51 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

Kernel Computation

Restricted C code, with short vector types


– Variable
• Formal parameters
• Local variables (auto storage)
• Limited data types, no pointers
– Control flow
• No goto
• No recursive calls
• Callee sub-kernels must be in the same compilation unit
– Limited set of standard library routines

=> A subset of the Brook language for current GPU architecture

52 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

26
The Brook+ Implementation

53 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

Overview
Compiler (brcc)
– Convert Brook+ program into C++ code
– Code generation for kernels
• CAL Backend: IL kernels, embedded in the C++ code

Runtime (brt)
– CAL runtime component
• Invoke CAL API to allocate memory, compile and execute IL kernels
– CPU runtime component
• Emulate the GPU kernel execution for debugging

Compiler-Runtime Interface
– A few C++ classes (brt.hpp, kerneldesc.hpp)
• brook::stream, brook::kernel, kernelDescriptor
– Compiler generates code
• Create class objects, set object properties
• Send message to objects
– Runtime implements the classes for each target platform

54 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

27
Compiling a Brook+ Program

Semantic Check

application.br
Kernel Compilers CPU-Code Editor

Preprocessor AST to C++


IL kernels
CPU emu. kernels runtime headers
Parser
.cpp output

AST

C++ compiler runtime lib

application.exe

55 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

Components Inherited From BrookGPU

Modified CTool
– C parser
– AST + manipulation
– AST => C

CPU-Code editor
– Insert calls to generic runtime API

CPU emulation kernel generator

56 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

28
AMD Specific Components

IL code generator

CAL runtime component

Language extensions
– More data types: int, double
– Beyond pure stream model: Scatter

Other enhancement
– Semantic error reporting, …

57 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

IL Code Generator
Map kernel parameters to CAL Input/Output
– Output streams, input streams, gather streams, scatter
streams, non-stream parameters
– Generate meta data to describe the mapping information

Translate kernel computation to IL instructions


– Straight forward translation
– No optimization

Multi-pass technique
– Multiple IL kernels to implementation the functionality for a
Brook+ kernel

=> The result is C++ code: construct and set up a


kernelDescriptor object

58 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

29
Input Stream Implementation

kernel void sum (float a<>, float b<>, out float c<>)
{
c = a + b;
}

Use CAL InputBuffer

Two methods to implement implicit read location


– vPos: the index for the input/output data for the current
thread
• sample_resource(0)_sampler(0) r0.x, vPos.xy //a
• sample_resource(1)_sampler(1) r1.x, vPos.xy //b
– Interpolated values (v0..v7)
• sample_resource(0)_sampler(0) r0.x, v0.xy //a
• sample_resource(1)_sampler(1) r1.x, v1.xy //b

59 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

Output Stream Implementation

kernel void sum (float a<>, float b<>, out float c<>)
{
c = a + b;
}

Use CAL OutputBuffer

Both CAL OutputBuffer and Brook Output Stream have


implicit write semantic
– mov o0, r4.x

60 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

30
Translating Kernel Computation

kernel void sum (float a<>, float b<>, out float c<>)
{
c = a + b;
}

sample_resource(0)_sampler(0) r0.x, v0.xy


sample_resource(1)_sampler(1) r1.x, v1.xy
mov r2.x, r0.x
mov r3.x, r1.x
call 0
mov r4.x, r5.x
mov o0, r4.x
ret

func 0
add r6.x, r2.x, r3.x
mov r7.x, r6.x
mov r5.x, r7.x
ret

61 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

More on Input/Output Stream


Implementation
HW limitation on OutputBuffer: o0..o7
– Multi-passes to support more than eight output streams
• Compiler generates multiple IL kernels for a Brook kernel with
more than eight output streams
– Perform the same computation, output different portion of the
result
– Rely on CAL Compiler for dead code elimination
• Each runtime pass runs an IL kernel to collect partial output

HW limitation on InputBuffer: i0..i127


– BrookGPU implemented “kernel split” technique
• We didn’t carry it over

62 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

31
Gather/Scatter Stream Implementation

Using CAL InputBuffer to implement Gather Streams


– IL instruction example: sample_resource(id)_sampler(id)
dst, index

Using CAL GlobalBuffer to implement Scatter Streams


– One GlobalBuffer: g
– IL instruction example: mov g[r1.x], r2

63 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

Non-Stream Parameter Implementation

Two methods
√ Using ConstBuffer
• Good for small data set, no indirect access
• IL instruction example: mov dst, cb0[index]

– Using InputBuffer
• IL instruction example: sample_resource(id)_sampler(id) dst,
index

64 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

32
Reduction Kernel Implementation

Multi-passes
– Compiler generates (n-1) IL kernels, one for each
reductionFactor ( reductionFactor = 2 .. N )
• n is currently hard-coded to 8
– Runtime executes multiple IL kernels to achieve the target
reductionFactor

65 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

Reduction Kernel Implementation


– An Example
reduce void
sumR (float a<>, reduce float c<>)
{
c = c + a; // c += a;
}

Target reductionFactor = 11 0 1 2 3 4 5 6 7 8 9 10

First pass
reductionFactor = 2

1 5 9 13 17 10

Second pass
reductionFactor = 6

55

66 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

33
Stream Construction

brook::stream defines the interface for stream


construction

Compiler translates stream declaration to brook::stream


construction

::brook::stream a(::brook::getStreamType((float *)0),


4,
float a<4>; -1
);

67 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

Prepare And Execute A Kernel (1/2)


brook::kernel defines interface to
– Construct a kernel object
– Pass kernel parameter to runtime
• PushStream
• PushOutput
• PushGatherStream
• PushScatterStream
• PushConstant
– Execute a kernel
• Map
• Reduce

Compiler inserts calls to prepare and execute a kernel

68 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

34
Prepare And Execute A Kernel (2/2)

kernel void sum (float a<>, float b<>, out float c<>)
{
c = a + b;
}

void sum (brook::stream a,


brook::stream b,
brook::stream c)
{

static ::brook::kernel __k(…);

__k->PushStream(a);
__k->PushStream(b);
__k->PushOutput(c);

__k->Map();
}

69 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

Representing A Kernel
Implementation

KernelDescriptor defines the representation of kernel


implementation

KernelDescriptor

implementations: TechniqueDesc
vector<TechniqueDesc>

passes:vector<PassDesc>
reductionFactor: int
PassDesc

shader: string
constants: vector<Input>
samplers: vector<Input>
outputs: vector<Input>

70 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

35
Putting It Together (1/3)

#include <stdio.h>
kernel void sum (float a<>, float b<>, out float c<>)
{
c = a + b;
}

int main()
{
float data[4] = {1.0, 2.0, 3.0, 4.0};
int i;
float a<4>; ::brook::stream a(::brook::getStreamType(( float *)0), 4,-1);
float c<4>; ::brook::stream c(::brook::getStreamType(( float *)0), 4,-1);

streamRead(a, data);
sum(a, a, c);
streamWrite(c, data);

for (i = 0; i < 4; i++)


printf("%f ", data[i]);
printf("\n");
return 0;
}

71 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

Putting It Together (2/3)

#include <stdio.h>
kernel void sum (float a<>, float b<>, out float c<>)
{
c = a + b; static const kernel_desc __sum_cal_desc
} = kernel_desc().pass(“

int main()
{
float data[4] = {1.0, 2.0, 3.0, 4.0}; sample_resource(0)_sampler(0) r0.x, v0.xy
int i; sample_resource(1)_sampler(1) r1.x, v1.xy
float a<4>; mov r2.x, r0.x
float c<4>; mov r3.x, r1.x
call 0
streamRead(a, data); mov r4.x, r5.x
sum(a, a, c); mov o0, r4.x
streamWrite(c, data); ret

for (i = 0; i < 4; i++)


printf("%f ", data[i]);
printf("\n"); func 0
return 0; add r6.x, r2.x, r3.x
} mov r7.x, r6.x
mov r5.x, r7.x
ret
end

“);
static const void *__sum_cal = &__sum_cal_desc;

72 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

36
Putting It Together (3/3)

#include <stdio.h>
kernel void sum (float a<>, float b<>, out float c<>) void sum (brook::stream a,
{ brook::stream b,
c = a + b;
} brook::stream c)

int main()
{ {
float data[4] = {1.0, 2.0, 3.0, 4.0}; static const void *__sum_fp[] =
int i;
float a<4>;
{
float c<4>; "cal", __sum_cal,
"cpu", NULL,
streamRead(a, data); NULL, NULL };
sum(a, a, c);
streamWrite(c, data);
static ::brook::kernel
for (i = 0; i < 4; i++) __k(__sum_fp);
printf("%f ", data[i]);
printf("\n");
__k->PushStream(a);
return 0;
} __k->PushStream(b);
__k->PushOutput(c);
__k->Map();

73 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

Summary

Brook+ Language
– C with stream extension
– Data parallel programming

Brook+ Compiler
– IL code generator
• Direct translation, no optimization
• Implement multi-pass technique
– CPU Code Editor
• Generate code to construct streams and kernels, prepare
and execute kernels

74 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

37
Future Work

Optimization
– High-level optimization: beyond kernels
– Intra-kernel optimization

Enhancement on semantic checking and error


reporting

Compiler-runtime co-design to improve performance

75 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

Agenda

Time Topic

1:30PM – 2:30PM Introduction to General-Purpose


Computation Using GPU (GPGPU)

2:40PM – 3:50PM Brook+: The Programming Language


and Compiler Implementation

4:00PM – 5:00PM Matrix Multiplication – A Case Study

Folding@Home Demo

76 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

38
Matrix Multiplication…
Breaking the 200GFlops Barrier

Ideal candidate for GPGPU speedup?

Data has good spatial locality

Fits well with 2D GPU cache

Benefits from high bandwidth

Assumptions to simplify the discussion:


– M=K=N
– Sizes are multiples of 16

ACML has a generic solution using some of these techniques

CAL Performance Counters and AMD GPU Shader Analyzer (GSA) for
performance analysis

77 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

Outline

Naïve CPU Implementation

Brook+ Implementations
– Naïve implementation
– Optimized implementation

Optimizing IL Kernel
– Initial IL kernel
– Loop unrolling
– Breaking data dependence chain
– Improving cache locality
– Further improving cache locality

78 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

39
System Information

CPU/System Information:
– 1.8 GHz Dual-Core AMD Opteron™ Model 2210
– 4GB of RAM
– Microsoft® Windows® XP SP 2
– ATI Radeon™ HD 3870 GPU, 512MB

Software:
– CAL SDK v1.00.2 beta*
– Brook+ SDK v1.00 beta*
– AMD GSA*

* Reference[2]

79 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

Naïve CPU
void matmultCPU( float* a, float* b, float* c, int m, int k, int n)

for ( int y = 0; y < m; y++)

for ( int x = 0; x < n; x++)

float temp = 0;

for ( int z = 0; z < k; z++)

temp += a[y * k + z] * b[z * n + x];

c[y * n + x] = temp;

80 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

40
Performance
Matrix Multiplication

0.25

0.2

0.15
GFlops

CPU Naïve

0.1

0.05

0
16 32 64 128 256 512 1024 2048 4096
Matrix Size

81 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

Transforming to Brook+

void matmultCPU( float* a, float* b, float* c, int m, int k, int n)


Convert to streams
{

for ( int y = 0; y < m; y++)

for ( int x = 0; x < n; x++)

float temp = 0;

for ( int z = 0; z < k; z++)

{ Convert to Handled
temp += a[y * k + z] * b[z * n + x]; while loop implicitly
by Brook+
}

c[y * n + x] = temp;

82 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

41
Naïve Brook+
kernel void

simple_matmult(float Width, float A[][], float B[][], out float C<>)

float2 vPos = indexof(C).xy; // Position of the output matrix i.e. (x,y)

float4 index = float4(vPos.x, 0.0f, 0.0f, vPos.y); // coordinates of A & B

float4 step = float4( 0.0f, 1.0f, 1.0f, 0.0f); // represents the step by which index is incremented

float accumulator = 0.0f; // Accumulates the result of intermediate calculations

float i0 = Width;

while (i0 > 0)

// A[M][K] * B[K][N]

accumulator += A[index.zw]*B[index.xy];

index += step;

i0 = i0 - 1.0f;

// Writing the result back to the buffer

C = accumulator;

83 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

Using AMD GSA

PLDI Tutorial

42
Naïve Statistics

•5 Registers

•7 ALU Instructions

•2 Gather Instructions

•48.57% ALU Utilization

•11 lines of Brook+ code

•Brook+ SDK: simple_matmult.exe

85 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

Performance
Matrix Multiplication

10

6
GFlops

CPU Naïve
5
Brook+ Naïve

0
16 32 64 128 256 512 1024 2048 4096
Matrix Size

86 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

43
Block Matrix Multiplication

A X B

PLDI Tutorial

Matrix Vectorization
float
X Y Z W

float4
A

PLDI Tutorial

44
Multiple I/O Streams

A1
A2
A3
A4
A Matrix
A5 C1
A6 C2
A7 C3
A8 C4
C Matrix
C5
B1 C6
C7
B2 C8
B Matrix
B3

B4

89 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

Optimized Brook+ Implementation


(1/2)
kernel void

optimized_matmult( float loopVar0, float4 A1[][], float4 A2[][], float4 A3[][], float4 A4[][], float4 A5[][], float4 A6[][],

float4 A7[][], float4 A8[][], float4 B1[][], float4 B2[][], float4 B3[][], float4 B4[][], out float4 C1<>,

out float4 C2<>, out float4 C3<>, out float4 C4<>, out float4 C5<>, out float4 C6<>,

out float4 C7<>, out float4 C8<>)

float2 vPos = indexof( C1).xy; // vPos - Position of the output matrix i.e. (x,y)

float4 four210 = float4( 4.0f, 2.0f, 1.0f, 0.0f); // Setting four210

float4 index = float4( vPos.x, vPos.y, four210.w, four210.w); // index - coordinates of A & B from where the values are fetched

// Declaring and initializing accumulators

float4 accumulator1 = accumulator2 = accumulator3 = accumulator4 = four210.wwww;

float4 accumulator5 = accumulator6 = accumulator7 = accumulator8 = four210.wwww;

float i0 = loopVar0;

// LOOP Here

C1 = accumulator1;

C2 = accumulator2;

C3 = accumulator3;

C4 = accumulator4;

C5 = accumulator5;

C6 = accumulator6;

C7 = accumulator7;

C8 = accumulator8;

90 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

45
Optimized Brook+ Implementation
(2/2)
while (i0 > 0.0f)

// Fetching values from A

float4 A11 = A1[index.wy]; float4 A22 = A2[index.wy]; float4 A33 = A3[index.wy]; float4 A44 = A4[index.wy];

float4 A55 = A5[index.wy]; float4 A66 = A6[index.wy]; float4 A77 = A7[index.wy]; float4 A88 = A8[index.wy];

// Fetching values from B

float4 B11 = B1[index.xw]; float4 B22 = B2[index.xw]; float4 B33 = B3[index.xw]; float4 B44 = B4[index.xw];

accumulator1 += A11.xxxx * B11.xyzw + A11.yyyy * B22.xyzw + A11.zzzz * B33.xyzw + A11.wwww * B44.xyzw;

accumulator2 += A22.xxxx * B11.xyzw + A22.yyyy * B22.xyzw + A22.zzzz * B33.xyzw + A22.wwww * B44.xyzw;

accumulator3 += A33.xxxx * B11.xyzw + A33.yyyy * B22.xyzw + A33.zzzz * B33.xyzw + A33.wwww * B44.xyzw;

accumulator4 += A44.xxxx * B11.xyzw + A44.yyyy * B22.xyzw + A44.zzzz * B33.xyzw + A44.wwww * B44.xyzw;

accumulator5 += A55.xxxx * B11.xyzw + A55.yyyy * B22.xyzw + A55.zzzz * B33.xyzw + A55.wwww * B44.xyzw;

accumulator6 += A66.xxxx * B11.xyzw + A66.yyyy * B22.xyzw + A66.zzzz * B33.xyzw + A66.wwww * B44.xyzw;

accumulator7 += A77.xxxx * B11.xyzw + A77.yyyy * B22.xyzw + A77.zzzz * B33.xyzw + A77.wwww * B44.xyzw;

accumulator8 += A88.xxxx * B11.xyzw + A88.yyyy * B22.xyzw + A88.zzzz * B33.xyzw + A88.wwww * B44.xyzw;

index += four210.wwwz;

// Reducing iterator

i0 = i0 - 1.0f;

}
91 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

Optimized Statistics

•29 Registers

•12 Gather Instructions

•58 ALU Instructions

•89.31% ALU utilization

•37 lines of Brook+ code

•Brook+ SDK: optimized_matmult.exe

92 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

46
Performance
Matrix Multiplication

120

100

80

CPU Naïve
GFlops

60 Brook+ Naïve
Brook+ Opt

40

20

0
16 32 64 128 256 512 1024 2048 4096
Matrix Size

93 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

Optimizing IL Kernel

94 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

47
IL Kernel - Setup

Setup Constants
Setup Input streams
Setup Loop and Data variables
Start Loop

End Loop
Setup Output Streams
Write out results

95 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

IL Kernel - Loop
whileloop mad r12, r16.y, r22, r6

itof r11._y__, r11.x mad r12, r16.x, r21, r12

ge r11._y__, r11.y, r37.x Check loop bound mad r12, r16.z, r23, r12

break_logicalnz r11.y mad r6, r16.w, r24, r12


Accumulator4
sample_resource(8)_sampler(8) r21, r0.xwxx mad r12, r17.y, r22, r7

sample_resource(9)_sampler(9) r22, r0.xwxx mad r12, r17.x, r21, r12

sample_resource(10)_sampler(10) r23, r0.xwxx B mad r12, r17.z, r23, r12

sample_resource(11)_sampler(11) r24, r0.xwxx mad r7, r17.w, r24, r12 Accumulator5


sample_resource(0)_sampler(0) r13, r0.wzww mad r12, r18.y, r22, r8

sample_resource(1)_sampler(1) r14, r0.wzww mad r12, r18.x, r21, r12

sample_resource(2)_sampler(2) r15, r0.wzww mad r12, r18.z, r23, r12

sample_resource(3)_sampler(3) r16, r0.wzww mad r8, r18.w, r24, r12 Accumulator6


sample_resource(4)_sampler(4) r17, r0.wzww
A mad r12, r19.y, r22, r9

sample_resource(5)_sampler(5) r18, r0.wzww mad r12, r19.x, r21, r12

sample_resource(6)_sampler(6) r19, r0.wzww mad r12, r19.z, r23, r12

sample_resource(7)_sampler(7) r20, r0.wzww mad r9, r19.w, r24, r12 Accumulator7


mad r12, r13.y, r22, r3 mad r12, r20.y, r22, r10

mad r12, r13.x, r21, r12 mad r12, r20.x, r21, r12

mad r12, r13.z, r23, r12 mad r12, r20.z, r23, r12

mad r3, r13.w, r24, r12


Accumulator1 mad r10, r20.w, r24, r12 Accumulator8
mad r12, r14.y, r22, r4 dcl_literal l1, 1, 1, 1, 1

mad r12, r14.x, r21, r12 mov r0.___w, r2.w


Increment Position Counter
mad r12, r14.z, r23, r12 add r2.___w, r2.x, r38.y

mad r4, r14.w, r24, r12 Accumulator2 iadd r11.x___, r11.x, l1


Increment Loop Counter
mad r12, r15.y, r22, r5 endloop

mad r12, r15.x, r21, r12

mad r12, r15.z, r23, r12

mad r5, r15.w, r24, r12


Accumulator3
96 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

48
IL Kernel - Summary
Setup Constants
Setup Input streams
Setup Loop and Data variables
Start Loop
Loop Iteration Check
Fetch B Stream Data
Fetch A Stream Data

Perform
sub-matrix
multiplication

Increment Loop Counter


Increment Position Counter
End Loop
Setup Output Streams
Write out results
97 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

Version 1 - Statistics

• 26 Registers
Matrix Size Cache Hit %
• 12 Gather Instructions 16 97.605
• 49 ALU Instructions 32 97.104

• 85.31% ALU utilization 64 97.427


128 98.694
256 98.837
512 98.591
1024 98.019
2048 97.486
4096 97.312

98 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

49
Performance
Matrix Multiplication

140

120

100

80
Brook+ Naïve
GFlops

Brook+ Opt
PLDIv1
60

40

20

0
16 32 64 128 256 512 1024 2048 4096
Matrix Size

99 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

Problems?
Setup Constants
Setup Input streams
Setup Loop and Data variables
Start Loop
Break check for Loop Iteration Check
every iteration Fetch B Stream Data
Fetch A Stream Data
Memory/ALU
Perform switching
sub-matrix frequently
multiplication

Increment Loop Counter


Increment Position Counter
Increment
every iteration End Loop
Setup Output Streams
Write out results
100 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

50
Loop Unrolling
Setup Constants & Input Streams
Setup Loop and Data variables
Start Loop
Loop Iteration Check 1/4th
Fetch B Stream Data 1
Fetch A Stream Data 1 branches
Perform required
sub-matrix 1
multiplication
Increment Position Counter
Fetch B Stream Data 2
Fetch A Stream Data 2
Perform
CAL Compiler sub-matrix 2
multiplication
has more Increment Position Counter
Fetch B Stream Data 3
information for Fetch A Stream Data 3
optimizations Perform
sub-matrix 3
and reordering multiplication
Increment Position Counter
Fetch B Stream Data 4
Fetch A Stream Data 4 1/4th
Perform
sub-matrix 4
increments
multiplication required
Increment Position Counter
Increment Loop Counter * 4
End Loop
Setup Output Streams
Write out results
101 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

Version 2 - Statistics

• 27 Registers Matrix Size Cache Hit %*


• 48 Gather Instructions 16 95.277
32 95.556
• 135 ALU Instructions
64 95.555
• 87.85% ALU utilization
128 98.584
256 98.787
512 98.535
1024 97.934
2048 97.688
4096 97.614
* Yellow – No change, Green – Increase, Red - Decline

102 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

51
Performance
Matrix Multiplication

180

160

140

120

100 Brook+ Naïve


GFlops

Brook+ Opt
PLDIv1
80 PLDIv2

60

40

20

0
16 32 64 128 256 512 1024 2048 4096
Matrix Size

103 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

Dependent instructions
Setup Constants & Input Streams
Setup Loop and Data variables
Start Loop
Loop Iteration Check
Fetch B Stream Data 1
Fetch A Stream Data 1
Perform
sub-matrix 1
multiplication
Increment Position Counter
Fetch B Stream Data 2
Fetch A Stream Data 2
Fetches Perform
depend on sub-matrix 2
multiplication
increments Increment Position Counter
Fetch B Stream Data 3
Fetch A Stream Data 3 Increments
Perform are
sub-matrix 3
multiplication dependent
Increment Position Counter
Fetch B Stream Data 4
Fetch A Stream Data 4
Perform
sub-matrix 4
multiplication
Increment Position Counter
Increment Loop Counter * 4
End Loop
Setup Output Streams
Write out results

104 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

52
Break Dependence Chain
Setup Constants & Input Streams
Setup Loop and Data variables
Start Loop
Loop Iteration Check
Fetches are no Increment Position Counter 1
Increment Position Counter 2
longer held up Increment Position Counter 3 Does it
by increments Increment Position Counter 4
Fetch B Stream Data 1 really
Fetch A Stream Data 1
Perform work?
sub-matrix 1
multiplication
Fetch B Stream Data 2
Fetch A Stream Data 2
Perform
Stream fetches sub-matrix 2
and submatrix multiplication
Fetch B Stream Data 3
blocks are now Fetch A Stream Data 3
Perform
rearrangable sub-matrix 3
multiplication
Fetch B Stream Data 4
Fetch A Stream Data 4
Perform
sub-matrix 4
multiplication
Increment Loop Counter * 4
End Loop
Setup Output Streams
Write out results
105 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

Version 3 – Statistics

• 28 Registers
Matrix Size Cache Hit %
• 48 Gather Instructions 16 95.278
• 136 ALU Instructions 32 95.555
64 96.763
• 87.35% ALU utilization
128 98.571
256 98.799
512 98.782
1024 98.421
2048 97.981
4096 97.662
* Yellow – No change, Green – Increase, Red - Decline

106 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

53
Performance
Matrix Multiplication

180

160

140

120

100 Brook+ Opt


GFlops

PLDIv1
PLDIv2
80 PLDIv3

60

40

20

0
16 32 64 128 256 512 1024 2048 4096
Matrix Size

107 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

How the cache works

Thread 1
Miss Hit!
I want !
(1,1)
(1,2)
(1,5)
(1,3)
(1,4)

Request

Cache Fetch
Threads Texture

Return

4 Dwords 2 Quads

108 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

54
Alternating Data Streams
Setup Constants & Input Streams
Setup Loop and Data variables
Start Loop
4 data fetches
Loop Iteration Check 8 data fetches
Increment Position Counter 1
Increment Position Counter 2
Increment Position Counter 3
Increment Position Counter 4
Fetch B Stream Data 1
Fetch A Stream Data 1
Perform
sub-matrix 1
multiplication
Fetch B Stream Data 2
Fetch A Stream Data 2
Perform
sub-matrix 2
multiplication
Fetch B Stream Data 3
Fetch A Stream Data 3
Perform
sub-matrix 3
multiplication
Fetch B Stream Data 4
Fetch A Stream Data 4
Perform
sub-matrix 4
multiplication
Bad for cache! Increment Loop Counter * 4
Data is thrown way End Loop
Setup Output Streams
before it is used. Write out results

109 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

Alternating Data Streams


Setup Constants & Input Streams
Setup Loop and Data variables
Start Loop
Loop Iteration Check 8x2 data fetches
4x2 data fetches Increment Position Counter 1
Increment Position Counter 2
Increment Position Counter 3
Increment Position Counter 4
Fetch B Stream Data 1
Fetch B Stream Data 2
Fetch A Stream Data 1
Perform
sub-matrix 1
multiplication
Fetch A Stream Data 2
Perform
sub-matrix 2
multiplication
Fetch B Stream Data 3
Fetch B Stream Data 4
Fetch A Stream Data 3
Perform
sub-matrix 3
Better for cache! multiplication
8 Quads of data Fetch A Stream Data 4
Perform
pulled each time. sub-matrix 4
multiplication
Increment Loop Counter * 4
End Loop
Setup Output Streams
Write out results

110 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

55
Version 4 - Statistics

•32 Registers
Matrix Size Cache Hit %
•48 Gather Instructions 16 95.277

•136 ALU Instructions 32 95.555


64 96.763
•87.35% ALU utilization
128 98.571
256 98.797
512 98.879
1024 98.641
2048 98.264
4096 97.798
* Yellow – No change, Green – Increase, Red - Decline

111 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

Performance
Matrix Multiplication

200

180

160

140

120
Brook+ Opt
PLDIv1
GFlops

100 PLDIv2
PLDIv3
PLDIv4
80

60

40

20

0
16 32 64 128 256 512 1024 2048 4096
Matrix Size

112 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

56
Still Alternating Data Streams
Setup Constants & Input Streams
Setup Loop and Data variables
Start Loop

8 data fetches
Loop Iteration Check 8 data fetches
Increment Position Counter 1
Increment Position Counter 2
Increment Position Counter 3
Increment Position Counter 4
Fetch B Stream Data 1
Fetch B Stream Data 2
Fetch A Stream Data 1
Perform
sub-matrix 1
multiplication
Fetch A Stream Data 2
Perform
sub-matrix 2
multiplication
Fetch B Stream Data 3
Fetch B Stream Data 4
Fetch A Stream Data 3
Perform
sub-matrix 3
multiplication
Fetch A Stream Data 4
Perform
sub-matrix 4
multiplication
Switching between Increment Loop Counter * 4
End Loop
B and A data Setup Output Streams
streams! Write out results

113 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

Still Alternating Data Streams


Setup Constants & Input Streams
Setup Loop and Data variables
Start Loop
4x4 data fetches
Loop Iteration Check 4x8 data fetches
Increment Position Counter 1
Increment Position Counter 2
Increment Position Counter 3
Increment Position Counter 4
Fetch B Stream Data 1
Fetch B Stream Data 2
Fetch B Stream Data 3
Fetch B Stream Data 4
Fetch A Stream Data 1
Perform
sub-matrix 1
multiplication
Fetch A Stream Data 2
Perform
Best cache sub-matrix 2
utilization! multiplication
Fetch A Stream Data 3
Perform
sub-matrix 3
multiplication
Fetch A Stream Data 4
Perform
sub-matrix 4
multiplication
Increment Loop Counter * 4
End Loop
Setup Output Streams
Write out results
114 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

57
Final Version - Statistics

•38 Registers
Matrix Size Cache Hit %
•48 Gather Instructions 16 95.302
•136 ALU Instructions 32 95.577
64 96.778
•87.35% ALU utilization
128 98.570
•CAL SDK: simple_matmult.exe
256 98.746
512 98.843
1024 98.928
2048 98.999
4096 99.00
* Yellow – No change, Green – Increase, Red - Decline

115 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

Performance
Matrix Multiplication

250

200

150
PLDIv1
PLDIv2
GFlops

PLDIv3
PLDIv4
PLDIv5
100

50

0
16 32 64 128 256 512 1024 2048 4096
Matrix Size

116 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

58
Summary

Loop Unrolling isn’t very beneficial by itself on large sizes since it


decreases ALU:Gather instruction ratio

Remove loop-dependent calculations by using more registers

Group data fetches from same texture allows better cache locality

Use more registers to hide data fetch latency, ~192 registers per
SIMD on ATI HD3870

Unknown loop size can be handled by switch statement before loop

Rely on the CAL compiler to handle register scheduling, allocation and


various micro and hardware specific optimizations

CAL compiler is a JIT compiler, need to enhance Brook+ compiler


with static optimizations

117 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

Tutorial Summary

Modern GPUs provide tremendous computation power

CAL provides APIs and a virtual instruction set to program AMD


GPUs

Brook+
– An approach to implement the Brook steam programming
language using CAL
– Application performance heavily relies on compiler optimization,
which is not available in the current implementation

118 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

59
References

[1] Accelerator: Using Data Parallelism to Program GPUs for


General-purpose Uses, Tarditi et al, ASPLOS 2006

[2] AMD Stream Computing SDK


http://ati.amd.com/technology/streamcomputing/

[3] Brook for GPUs: Stream Computing on Graphics Hardware,


Buck et al, SIGGRAPH 2004

[4] Rapidmind http://www.rapidmind.net/

119 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

Trademark Attribution

AMD, the AMD Arrow logo, AMD Opteron, ATI, the ATI logo, Radeon, and combinations thereof are trademarks of Advanced
Micro Devices, Inc. Microsoft, Windows, and Windows Vista are registered trademarks of Microsoft Corporation in the United
States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective
owners.

©2008 Advanced Micro Devices, Inc. All rights reserved.

120 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial

60

You might also like