PLDI08Tutorial
PLDI08Tutorial
Agenda
Time Topic
Folding@Home Demo
2 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
1
Introduction to GPGPU
3 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
Outline
What is a GPU?
- Qualitative overview
- Tour of current AMD GPU architecture
4 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
2
CPU vs. GPU
CPU GPU
but but
5 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
- Latency-tolerant computations
- High locality
6 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
3
How good are they?
4
R670 Architecture (1)
- ~700 million transistors
- 75 GB/s memory bandwidth
- 256b DDR3/4 interface
- Targeted for handling thousands of
simultaneous lightweight threads
- Instruction cache and constant cache for
unlimited program size
- Scalar ALU implementation with 320
(64x5) independent stream processors
- 256 (64x4) basic units (FMAD)
- 64 enhanced transcendental units
(COS, LOG, EXP, RSQ, etc.)
- Directly supports int, unsigned int, float
and double
- Collection of graphics-specific units
surrounding a general-purpose compute
core
9 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
x x
transcendental
pipe
+ +
10 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
5
R670 Architecture (3) – Processing
Element
GPRs
Input
Mux
01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
+- +- +- +- +- +- +- +- +- +- +- +- +- +- +-
ALU
Output
Mux
64 of these
11 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
0 1 ... 15
GPRs
(256
high)
ALU ...
4 of these
12 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
6
R670 Architecture (5) – Compute
Data Path
Fetch
SIMD0
100s of clocks
SIMD1
Memory
SIMD2
SIMD3
Output
13 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
SIMD
Wavefront
14 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
7
R670 Architecture (7) –
Conditionals via Predication
Only one “real” PC per wavefront, but threads are
automatically predicated – for if-then-else constructs, the
hardware will execute both sides of the branch if required.
- function calls
and
- predicating branches
16 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
8
R670 Architecture (9) –
Multithreading
Memory fetch latency is quite high, so R670 is heavily
multithreaded to compensate.
Wavefront 1
Wavefront 2
Wavefront 3
Wavefront 4
Fetch unit
utilisation
ALU utilisation
17 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
L0 P252
L1 P253
Wavefront 63
L2 P254
L3 P255
Therefore, the more registers per thread, the less threads available
for latency compensation.
18 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
9
R670 Architecture (11) - Numerics
Integers
- round-to-even only
- no denorms
19 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
20 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
10
How to Program GPU for
General-Purpose Computation?
21 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
22 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
11
Survey of GPGPU programming
approaches
Driver Model + Shader/Program/Kernel
– GLSL / HLSL / OpenGL / DirectX®
23 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
24 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
12
Where does CAL fit in?
25 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
T0,y-1T1,y-1 Tx-1,y-1
Scheduler
ConstBuffer GlobalBuffer
Processing Cores
26 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
13
Writing A CAL Application
27 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
CAL API
Device management
– Open/Close and manage multiple devices
Context Management
Memory management
– Access to local GPU memory and remote system memory
– Ability to read/write directly to remote system memory
Program execution
– Module loading
– Input/output binding
28 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
14
CAL Kernel
The program executes on the GPU
– Language
• AMD IL
– Compiler
• GPU JIT compiler
• Optimized GPU ISA instructions
– Input/Output
CAL Memory
Execution Domain
– A region in the OutBuffer
• (x0, y0) (w, h)
– CAL Kernel is invoked once for each element in the region
• Output to the element
• vPos represents the index of the element
29 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
AMD IL
Instruction syntax
<opcode>[_<ctrl>][_<ctrl(val)>] <= opcode with specifiers
30 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
15
IL Instruction Opcodes
Declaration and initialization
– CAL Memory (input, constant buffer)
– Literals
ALU instructions
– float, double, int
– Comparison, bit-wise, arithmetic, trigonometirc
Flow control
– if, whileloop, call, …
Others
– mov, cmov, …
31 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
IL Instruction Operands
Memory “array”
– globalBuffer: g
– constBuffer: cb0, cb1, …
Special registers
– Implicitly write element in the outputBuffer: o0, o1, …
– Interpolated values: v0, v1, …
Virtual GPRs
– r0, r1, …
Literal constants
– l0, l1, …
32 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
16
CAL Summary
33 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
Agenda
Time Topic
Folding@Home Demo
34 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
17
Building Tools With CAL
35 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
BrookGPU
– An open source implementation of an ANSIC C like stream
programming language on GPU (Stanford University, [3])
Brook+
– BrookGPU with enhancement to enable AMD GPU features
• CAL backend
• More data types: int, double
• Beyond pure stream model: gather, scatter
36 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
18
Outline
Summary
37 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
38 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
19
Brook+ Language Constructs For
Stream Programming
Stream
– A collection of data to be operated on in parallel
Kernel
– A function that operates on stream elements
39 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
#include <stdio.h>
kernel void sum (float a<>, float b<>, out float c<>)
{
Kernel
c = a + b;
}
int main()
{
float data[4] = {1.0, 2.0, 3.0, 4.0};
float a<4>, c<4>; Streams
int i;
streamRead(a, data);
Intrinsic functions
sum(a, a, c);
for copying data
streamWrite(c, data);
20
Defining a Stream Variable
Examples
float a<4>;
int b<10>;
double c<20>;
Syntax
elementType variableName streamShape
41 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
Basic types
– float
– int (AMD extension)
– double (AMD extension)
42 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
21
Stream Shapes
Up to four dimension
– 1D: <n1>
– 2D: <n2, n1>
– 3D: <n3, n2, n1>
– 4D: <n4, n3, n2, n1>
43 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
Kernel
Examples
kernel void sum (…) {…}
44 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
22
Kernel Parameters: Input Streams
kernel void sum (float a<>, float b<>, out float c<>)
{
c = a + b;
}
45 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
kernel void sum (float a<>, float b<>, out float c<>)
{
c = a + b;
}
46 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
23
Calling A Map Kernel
a 0 1 2 3
kernel void
sum (float a<>, float b<>, out float c<>) b 10 11 12 13
{
c = a + b;
}
c 10 12 14 16
47 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
Reduction Kernel
48 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
24
Calling a Reduction Kernel
a 0 1 2 3 4 5 6 7
reduce void
sumR (float a<>, reduce float c<>)
{
c = c + a; // c += a;
}
c 6 22
49 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
Non-Stream Parameters
50 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
25
Random Access Stream Parameters
51 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
Kernel Computation
52 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
26
The Brook+ Implementation
53 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
Overview
Compiler (brcc)
– Convert Brook+ program into C++ code
– Code generation for kernels
• CAL Backend: IL kernels, embedded in the C++ code
Runtime (brt)
– CAL runtime component
• Invoke CAL API to allocate memory, compile and execute IL kernels
– CPU runtime component
• Emulate the GPU kernel execution for debugging
Compiler-Runtime Interface
– A few C++ classes (brt.hpp, kerneldesc.hpp)
• brook::stream, brook::kernel, kernelDescriptor
– Compiler generates code
• Create class objects, set object properties
• Send message to objects
– Runtime implements the classes for each target platform
54 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
27
Compiling a Brook+ Program
Semantic Check
application.br
Kernel Compilers CPU-Code Editor
AST
application.exe
55 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
Modified CTool
– C parser
– AST + manipulation
– AST => C
CPU-Code editor
– Insert calls to generic runtime API
56 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
28
AMD Specific Components
IL code generator
Language extensions
– More data types: int, double
– Beyond pure stream model: Scatter
Other enhancement
– Semantic error reporting, …
57 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
IL Code Generator
Map kernel parameters to CAL Input/Output
– Output streams, input streams, gather streams, scatter
streams, non-stream parameters
– Generate meta data to describe the mapping information
Multi-pass technique
– Multiple IL kernels to implementation the functionality for a
Brook+ kernel
58 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
29
Input Stream Implementation
kernel void sum (float a<>, float b<>, out float c<>)
{
c = a + b;
}
59 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
kernel void sum (float a<>, float b<>, out float c<>)
{
c = a + b;
}
60 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
30
Translating Kernel Computation
kernel void sum (float a<>, float b<>, out float c<>)
{
c = a + b;
}
func 0
add r6.x, r2.x, r3.x
mov r7.x, r6.x
mov r5.x, r7.x
ret
61 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
62 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
31
Gather/Scatter Stream Implementation
63 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
Two methods
√ Using ConstBuffer
• Good for small data set, no indirect access
• IL instruction example: mov dst, cb0[index]
– Using InputBuffer
• IL instruction example: sample_resource(id)_sampler(id) dst,
index
64 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
32
Reduction Kernel Implementation
Multi-passes
– Compiler generates (n-1) IL kernels, one for each
reductionFactor ( reductionFactor = 2 .. N )
• n is currently hard-coded to 8
– Runtime executes multiple IL kernels to achieve the target
reductionFactor
65 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
Target reductionFactor = 11 0 1 2 3 4 5 6 7 8 9 10
First pass
reductionFactor = 2
1 5 9 13 17 10
Second pass
reductionFactor = 6
55
66 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
33
Stream Construction
67 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
68 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
34
Prepare And Execute A Kernel (2/2)
kernel void sum (float a<>, float b<>, out float c<>)
{
c = a + b;
}
__k->PushStream(a);
__k->PushStream(b);
__k->PushOutput(c);
__k->Map();
}
69 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
Representing A Kernel
Implementation
KernelDescriptor
implementations: TechniqueDesc
vector<TechniqueDesc>
passes:vector<PassDesc>
reductionFactor: int
PassDesc
…
shader: string
constants: vector<Input>
samplers: vector<Input>
outputs: vector<Input>
…
70 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
35
Putting It Together (1/3)
#include <stdio.h>
kernel void sum (float a<>, float b<>, out float c<>)
{
c = a + b;
}
int main()
{
float data[4] = {1.0, 2.0, 3.0, 4.0};
int i;
float a<4>; ::brook::stream a(::brook::getStreamType(( float *)0), 4,-1);
float c<4>; ::brook::stream c(::brook::getStreamType(( float *)0), 4,-1);
streamRead(a, data);
sum(a, a, c);
streamWrite(c, data);
71 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
#include <stdio.h>
kernel void sum (float a<>, float b<>, out float c<>)
{
c = a + b; static const kernel_desc __sum_cal_desc
} = kernel_desc().pass(“
int main()
{
float data[4] = {1.0, 2.0, 3.0, 4.0}; sample_resource(0)_sampler(0) r0.x, v0.xy
int i; sample_resource(1)_sampler(1) r1.x, v1.xy
float a<4>; mov r2.x, r0.x
float c<4>; mov r3.x, r1.x
call 0
streamRead(a, data); mov r4.x, r5.x
sum(a, a, c); mov o0, r4.x
streamWrite(c, data); ret
“);
static const void *__sum_cal = &__sum_cal_desc;
72 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
36
Putting It Together (3/3)
#include <stdio.h>
kernel void sum (float a<>, float b<>, out float c<>) void sum (brook::stream a,
{ brook::stream b,
c = a + b;
} brook::stream c)
int main()
{ {
float data[4] = {1.0, 2.0, 3.0, 4.0}; static const void *__sum_fp[] =
int i;
float a<4>;
{
float c<4>; "cal", __sum_cal,
"cpu", NULL,
streamRead(a, data); NULL, NULL };
sum(a, a, c);
streamWrite(c, data);
static ::brook::kernel
for (i = 0; i < 4; i++) __k(__sum_fp);
printf("%f ", data[i]);
printf("\n");
__k->PushStream(a);
return 0;
} __k->PushStream(b);
__k->PushOutput(c);
__k->Map();
73 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
Summary
Brook+ Language
– C with stream extension
– Data parallel programming
Brook+ Compiler
– IL code generator
• Direct translation, no optimization
• Implement multi-pass technique
– CPU Code Editor
• Generate code to construct streams and kernels, prepare
and execute kernels
74 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
37
Future Work
Optimization
– High-level optimization: beyond kernels
– Intra-kernel optimization
75 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
Agenda
Time Topic
Folding@Home Demo
76 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
38
Matrix Multiplication…
Breaking the 200GFlops Barrier
CAL Performance Counters and AMD GPU Shader Analyzer (GSA) for
performance analysis
77 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
Outline
Brook+ Implementations
– Naïve implementation
– Optimized implementation
Optimizing IL Kernel
– Initial IL kernel
– Loop unrolling
– Breaking data dependence chain
– Improving cache locality
– Further improving cache locality
78 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
39
System Information
CPU/System Information:
– 1.8 GHz Dual-Core AMD Opteron™ Model 2210
– 4GB of RAM
– Microsoft® Windows® XP SP 2
– ATI Radeon™ HD 3870 GPU, 512MB
Software:
– CAL SDK v1.00.2 beta*
– Brook+ SDK v1.00 beta*
– AMD GSA*
* Reference[2]
79 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
Naïve CPU
void matmultCPU( float* a, float* b, float* c, int m, int k, int n)
float temp = 0;
c[y * n + x] = temp;
80 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
40
Performance
Matrix Multiplication
0.25
0.2
0.15
GFlops
CPU Naïve
0.1
0.05
0
16 32 64 128 256 512 1024 2048 4096
Matrix Size
81 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
Transforming to Brook+
float temp = 0;
{ Convert to Handled
temp += a[y * k + z] * b[z * n + x]; while loop implicitly
by Brook+
}
c[y * n + x] = temp;
82 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
41
Naïve Brook+
kernel void
float4 step = float4( 0.0f, 1.0f, 1.0f, 0.0f); // represents the step by which index is incremented
float i0 = Width;
// A[M][K] * B[K][N]
accumulator += A[index.zw]*B[index.xy];
index += step;
i0 = i0 - 1.0f;
C = accumulator;
83 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
PLDI Tutorial
42
Naïve Statistics
•5 Registers
•7 ALU Instructions
•2 Gather Instructions
85 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
Performance
Matrix Multiplication
10
6
GFlops
CPU Naïve
5
Brook+ Naïve
0
16 32 64 128 256 512 1024 2048 4096
Matrix Size
86 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
43
Block Matrix Multiplication
A X B
PLDI Tutorial
Matrix Vectorization
float
X Y Z W
float4
A
PLDI Tutorial
44
Multiple I/O Streams
A1
A2
A3
A4
A Matrix
A5 C1
A6 C2
A7 C3
A8 C4
C Matrix
C5
B1 C6
C7
B2 C8
B Matrix
B3
B4
89 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
optimized_matmult( float loopVar0, float4 A1[][], float4 A2[][], float4 A3[][], float4 A4[][], float4 A5[][], float4 A6[][],
float4 A7[][], float4 A8[][], float4 B1[][], float4 B2[][], float4 B3[][], float4 B4[][], out float4 C1<>,
out float4 C2<>, out float4 C3<>, out float4 C4<>, out float4 C5<>, out float4 C6<>,
float2 vPos = indexof( C1).xy; // vPos - Position of the output matrix i.e. (x,y)
float4 index = float4( vPos.x, vPos.y, four210.w, four210.w); // index - coordinates of A & B from where the values are fetched
float i0 = loopVar0;
// LOOP Here
C1 = accumulator1;
C2 = accumulator2;
C3 = accumulator3;
C4 = accumulator4;
C5 = accumulator5;
C6 = accumulator6;
C7 = accumulator7;
C8 = accumulator8;
90 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
45
Optimized Brook+ Implementation
(2/2)
while (i0 > 0.0f)
float4 A11 = A1[index.wy]; float4 A22 = A2[index.wy]; float4 A33 = A3[index.wy]; float4 A44 = A4[index.wy];
float4 A55 = A5[index.wy]; float4 A66 = A6[index.wy]; float4 A77 = A7[index.wy]; float4 A88 = A8[index.wy];
float4 B11 = B1[index.xw]; float4 B22 = B2[index.xw]; float4 B33 = B3[index.xw]; float4 B44 = B4[index.xw];
index += four210.wwwz;
// Reducing iterator
i0 = i0 - 1.0f;
}
91 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
Optimized Statistics
•29 Registers
92 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
46
Performance
Matrix Multiplication
120
100
80
CPU Naïve
GFlops
60 Brook+ Naïve
Brook+ Opt
40
20
0
16 32 64 128 256 512 1024 2048 4096
Matrix Size
93 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
Optimizing IL Kernel
94 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
47
IL Kernel - Setup
Setup Constants
Setup Input streams
Setup Loop and Data variables
Start Loop
End Loop
Setup Output Streams
Write out results
95 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
IL Kernel - Loop
whileloop mad r12, r16.y, r22, r6
ge r11._y__, r11.y, r37.x Check loop bound mad r12, r16.z, r23, r12
mad r12, r13.x, r21, r12 mad r12, r20.x, r21, r12
mad r12, r13.z, r23, r12 mad r12, r20.z, r23, r12
48
IL Kernel - Summary
Setup Constants
Setup Input streams
Setup Loop and Data variables
Start Loop
Loop Iteration Check
Fetch B Stream Data
Fetch A Stream Data
Perform
sub-matrix
multiplication
Version 1 - Statistics
• 26 Registers
Matrix Size Cache Hit %
• 12 Gather Instructions 16 97.605
• 49 ALU Instructions 32 97.104
98 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
49
Performance
Matrix Multiplication
140
120
100
80
Brook+ Naïve
GFlops
Brook+ Opt
PLDIv1
60
40
20
0
16 32 64 128 256 512 1024 2048 4096
Matrix Size
99 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
Problems?
Setup Constants
Setup Input streams
Setup Loop and Data variables
Start Loop
Break check for Loop Iteration Check
every iteration Fetch B Stream Data
Fetch A Stream Data
Memory/ALU
Perform switching
sub-matrix frequently
multiplication
50
Loop Unrolling
Setup Constants & Input Streams
Setup Loop and Data variables
Start Loop
Loop Iteration Check 1/4th
Fetch B Stream Data 1
Fetch A Stream Data 1 branches
Perform required
sub-matrix 1
multiplication
Increment Position Counter
Fetch B Stream Data 2
Fetch A Stream Data 2
Perform
CAL Compiler sub-matrix 2
multiplication
has more Increment Position Counter
Fetch B Stream Data 3
information for Fetch A Stream Data 3
optimizations Perform
sub-matrix 3
and reordering multiplication
Increment Position Counter
Fetch B Stream Data 4
Fetch A Stream Data 4 1/4th
Perform
sub-matrix 4
increments
multiplication required
Increment Position Counter
Increment Loop Counter * 4
End Loop
Setup Output Streams
Write out results
101 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
Version 2 - Statistics
102 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
51
Performance
Matrix Multiplication
180
160
140
120
Brook+ Opt
PLDIv1
80 PLDIv2
60
40
20
0
16 32 64 128 256 512 1024 2048 4096
Matrix Size
103 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
Dependent instructions
Setup Constants & Input Streams
Setup Loop and Data variables
Start Loop
Loop Iteration Check
Fetch B Stream Data 1
Fetch A Stream Data 1
Perform
sub-matrix 1
multiplication
Increment Position Counter
Fetch B Stream Data 2
Fetch A Stream Data 2
Fetches Perform
depend on sub-matrix 2
multiplication
increments Increment Position Counter
Fetch B Stream Data 3
Fetch A Stream Data 3 Increments
Perform are
sub-matrix 3
multiplication dependent
Increment Position Counter
Fetch B Stream Data 4
Fetch A Stream Data 4
Perform
sub-matrix 4
multiplication
Increment Position Counter
Increment Loop Counter * 4
End Loop
Setup Output Streams
Write out results
104 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
52
Break Dependence Chain
Setup Constants & Input Streams
Setup Loop and Data variables
Start Loop
Loop Iteration Check
Fetches are no Increment Position Counter 1
Increment Position Counter 2
longer held up Increment Position Counter 3 Does it
by increments Increment Position Counter 4
Fetch B Stream Data 1 really
Fetch A Stream Data 1
Perform work?
sub-matrix 1
multiplication
Fetch B Stream Data 2
Fetch A Stream Data 2
Perform
Stream fetches sub-matrix 2
and submatrix multiplication
Fetch B Stream Data 3
blocks are now Fetch A Stream Data 3
Perform
rearrangable sub-matrix 3
multiplication
Fetch B Stream Data 4
Fetch A Stream Data 4
Perform
sub-matrix 4
multiplication
Increment Loop Counter * 4
End Loop
Setup Output Streams
Write out results
105 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
Version 3 – Statistics
• 28 Registers
Matrix Size Cache Hit %
• 48 Gather Instructions 16 95.278
• 136 ALU Instructions 32 95.555
64 96.763
• 87.35% ALU utilization
128 98.571
256 98.799
512 98.782
1024 98.421
2048 97.981
4096 97.662
* Yellow – No change, Green – Increase, Red - Decline
106 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
53
Performance
Matrix Multiplication
180
160
140
120
PLDIv1
PLDIv2
80 PLDIv3
60
40
20
0
16 32 64 128 256 512 1024 2048 4096
Matrix Size
107 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
Thread 1
Miss Hit!
I want !
(1,1)
(1,2)
(1,5)
(1,3)
(1,4)
Request
Cache Fetch
Threads Texture
Return
4 Dwords 2 Quads
108 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
54
Alternating Data Streams
Setup Constants & Input Streams
Setup Loop and Data variables
Start Loop
4 data fetches
Loop Iteration Check 8 data fetches
Increment Position Counter 1
Increment Position Counter 2
Increment Position Counter 3
Increment Position Counter 4
Fetch B Stream Data 1
Fetch A Stream Data 1
Perform
sub-matrix 1
multiplication
Fetch B Stream Data 2
Fetch A Stream Data 2
Perform
sub-matrix 2
multiplication
Fetch B Stream Data 3
Fetch A Stream Data 3
Perform
sub-matrix 3
multiplication
Fetch B Stream Data 4
Fetch A Stream Data 4
Perform
sub-matrix 4
multiplication
Bad for cache! Increment Loop Counter * 4
Data is thrown way End Loop
Setup Output Streams
before it is used. Write out results
109 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
110 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
55
Version 4 - Statistics
•32 Registers
Matrix Size Cache Hit %
•48 Gather Instructions 16 95.277
111 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
Performance
Matrix Multiplication
200
180
160
140
120
Brook+ Opt
PLDIv1
GFlops
100 PLDIv2
PLDIv3
PLDIv4
80
60
40
20
0
16 32 64 128 256 512 1024 2048 4096
Matrix Size
112 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
56
Still Alternating Data Streams
Setup Constants & Input Streams
Setup Loop and Data variables
Start Loop
8 data fetches
Loop Iteration Check 8 data fetches
Increment Position Counter 1
Increment Position Counter 2
Increment Position Counter 3
Increment Position Counter 4
Fetch B Stream Data 1
Fetch B Stream Data 2
Fetch A Stream Data 1
Perform
sub-matrix 1
multiplication
Fetch A Stream Data 2
Perform
sub-matrix 2
multiplication
Fetch B Stream Data 3
Fetch B Stream Data 4
Fetch A Stream Data 3
Perform
sub-matrix 3
multiplication
Fetch A Stream Data 4
Perform
sub-matrix 4
multiplication
Switching between Increment Loop Counter * 4
End Loop
B and A data Setup Output Streams
streams! Write out results
113 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
57
Final Version - Statistics
•38 Registers
Matrix Size Cache Hit %
•48 Gather Instructions 16 95.302
•136 ALU Instructions 32 95.577
64 96.778
•87.35% ALU utilization
128 98.570
•CAL SDK: simple_matmult.exe
256 98.746
512 98.843
1024 98.928
2048 98.999
4096 99.00
* Yellow – No change, Green – Increase, Red - Decline
115 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
Performance
Matrix Multiplication
250
200
150
PLDIv1
PLDIv2
GFlops
PLDIv3
PLDIv4
PLDIv5
100
50
0
16 32 64 128 256 512 1024 2048 4096
Matrix Size
116 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
58
Summary
Group data fetches from same texture allows better cache locality
Use more registers to hide data fetch latency, ~192 registers per
SIMD on ATI HD3870
117 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
Tutorial Summary
Brook+
– An approach to implement the Brook steam programming
language using CAL
– Application performance heavily relies on compiler optimization,
which is not available in the current implementation
118 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
59
References
119 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
Trademark Attribution
AMD, the AMD Arrow logo, AMD Opteron, ATI, the ATI logo, Radeon, and combinations thereof are trademarks of Advanced
Micro Devices, Inc. Microsoft, Windows, and Windows Vista are registered trademarks of Microsoft Corporation in the United
States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective
owners.
120 June 8, 2008 Building a High Level Language Compiler for GPGPU PLDI Tutorial
60