B1 Data Parallel
B1 Data Parallel
Overview
ARCS 2008
ARCS 2008
Output Arrays:
1D, 3D (slice),
2D (typical)
ARCS 2008
Output Arrays:
1D, 3D (slice),
2D (typical)
ARCS 2008
Output Arrays:
1D, 3D (slice),
2D (typical)
Rasterizer
Creates data streams
from index regions
5
ARCS 2008
Output Arrays:
1D, 3D (slice),
2D (typical)
Rasterizer
Creates data streams
from index regions
ARCS 2008
Output Arrays:
1D, 3D (slice),
2D (typical)
Rasterizer
Creates data streams
from index regions
ARCS 2008
Output Arrays:
1D, 3D (slice),
2D (typical)
Rasterizer
Creates data streams
from index regions
ARCS 2008
Output Arrays:
nD
Start thousands of
parallel threads in
groups of m, e.g. 32
ARCS 2008
Start thousands of
parallel threads in
groups of m, e.g. 32
Output Arrays:
nD
10
ARCS 2008
Output Arrays:
nD
Start thousands of
parallel threads in
groups of m, e.g. 32
GPU
1D input
1D output
Other dimensions with offsets
Input
ARCS 2008
2D input
2D output
Other dimensions with offsets
Input
Output
Output
12
ARCS 2008
Output region
Fastest option
Output region
Line segments
Slower, try to pair lines to 2xh,
wx2 quads
Output region
Point Clouds
Slowest, try to gather points into
larger forms
13
ARCS 2008
CPU
Large cache
Few processing elements
Optimized for spatial and
temporal data reuse
GPU
Small cache
Many processing elements
Optimized for sequential
(streaming) data access
Pentium 4
chart courtesy
of Ian Buck
14
ARCS 2008
GPU
Input
Input
Output
Output
15
Configuration Overhead
Configuration
limited
ARCS 2008
Computation
limited
chart courtesy
of Ian Buck
16
Overview
ARCS 2008
17
ARCS 2008
Map: x= f(a)
Input
Output
Input
Output
Input
Output
Input
Output
18
ARCS 2008
chart courtesy of
Naga Govindaraju
19
ARCS 2008
input
20
ARCS 2008
N/2input
x N/2array
output
21
ARCS 2008
gather 2x2
regions for
each output
22
ARCS 2008
first output
23
ARCS 2008
maximum of
2x2 region
24
ARCS 2008
intermediates
25
input
intermediates
ARCS 2008
result
26
ARCS 2008
27
ARCS 2008
Overview
Parallel Processing on GPUs
Types of Parallel Data Flow
Parallel Prefix or Scan
Precision and Accuracy
slides courtesy of
Shubho Sengupta
28
ARCS 2008
Motivation
Stream Compaction
3
Split
T
Null
Motivation
ARCS 2008
ARCS 2008
Scan
Input
11
11
15
16
22
Exclusive
11
11
15
16
22
25
Inclusive
ARCS 2008
ARCS 2008
ARCS 2008
Scan - Reduce
3
11
14
log n steps
Work halves each
step
O(n) work
11
25
In place, space
efficient
ARCS 2008
11
25
11
11
11
16
11
11
15
16
22
log n steps
Work doubles
each step
O(n) work
In place, space
efficient
ARCS 2008
Segmented Scan
Input
3
Segmented Scan
Introduced by Schwartz (1980)
Forms the basis for a wide variety of algorithms
Radixsort, Quicksort
Sparse Matrix-Vector Multiply
Convex Hull
Solving recurrences
Tree operations
ARCS 2008
ARCS 2008
ARCS 2008
ARCS 2008
Overview
ARCS 2008
41
ARCS 2008
single precision
double precision
Smaller is better
-30
-40
-50
-60
-70
-80
-90
-100
10
20
30
40
50
ARCS 2008
43
ARCS 2008
r = c*a-c*b
r = c(a-b)
r = a+b+c
r = a+(b+c)
r = i<100ai 1
r = i<100ai = 1
r = 0.1134*(a+1)
r = 0.1134a+0.1134
ARCS 2008
Area
Latency
min(r,0)
max(r,0)
b+1
add(r1,r2)
sub(r1,r2)
2b
add(r1,r2,r3)add(r4,r5)
2b
mult(r1,r2)
sqr(r)
b(b-2)
b ld(b)
2c(c-5)
c(c+3)
sqrt(r)
ARCS 2008
Adder
Multiplier
CG kernel normalized (1/30)
1200
Number of slices
Smaller is better
1400
1000
800
600
400
200
0
20
25
30
35
40
45
50
Bits of mantissa
[Gddeke et al. Performance and accuracy of hardware-oriented native-, emulated- and mixedprecision solvers in FEM simulations, IJPEDS 2007]
46
ARCS 2008
i =1
j =1
im
jm
2
a
2
i bj =
( i + j ) m
2
ai b j +
i , j =1
i + j k +1
( i + j ) m
2
ai b j
i , j =1
i + j >k +1
ARCS 2008
Software emulation
10x float add 1x double add
20x float mul 1x double mul
48
ARCS 2008
49
Larger is better
ARCS 2008
chart courtesy
of Jack Dongarra
[Langou et al. Exploiting the performance of 32 bit floating point arithmetic in obtaining 64 bit accuracy
(revisiting iterative refinement for linear systems), SC 2006]
50
ARCS 2008
Smaller is better
5e-4
5e-5
CG CPU
CG GPU
MG2+2 CPU
MG2+2 GPU
5e-6
5e-7
10
Data level
[Gddeke et al. Performance and accuracy of hardware-oriented native-, emulated- and mixedprecision solvers in FEM simulations, IJPEDS 2007]
51
Conclusions
ARCS 2008
52