0% found this document useful (0 votes)
45 views

B1 Data Parallel

This document summarizes a presentation on parallel processing on GPUs. It discusses types of parallel data flow including map, gather, scatter operations. It also covers parallel prefix or scan operations and how they can be used to implement parallel reduction, as well as segmented scan. Finally, it discusses concepts of precision and accuracy in parallel computing and techniques for quantization to preserve accuracy while reducing precision.

Uploaded by

Tg Wallas
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

B1 Data Parallel

This document summarizes a presentation on parallel processing on GPUs. It discusses types of parallel data flow including map, gather, scatter operations. It also covers parallel prefix or scan operations and how they can be used to implement parallel reduction, as well as segmented scan. Finally, it discusses concepts of precision and accuracy in parallel computing and techniques for quantization to preserve accuracy while reducing precision.

Uploaded by

Tg Wallas
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

ARCS 2008

Data parallel algorithms,


algorithmic building blocks,
precision vs. accuracy
Robert Strzodka
ARCS 2008 Architecture of Computing Systems
GPGPU and CUDA Tutorials
Dresden, Germany, February 25 2008

Overview

ARCS 2008

Parallel Processing on GPUs


Types of Parallel Data Flow
Parallel Prefix or Scan
Precision and Accuracy

ARCS 2008

The GPU is a Fast, Parallel Array Processor


Input Arrays:
1D, 3D,
2D (typical)

Output Arrays:
1D, 3D (slice),
2D (typical)

ARCS 2008

The GPU is a Fast, Parallel Array Processor


Input Arrays:
1D, 3D,
2D (typical)

Output Arrays:
1D, 3D (slice),
2D (typical)

Vertex Processor (VP)


Kernel changes index
regions of output arrays

ARCS 2008

The GPU is a Fast, Parallel Array Processor


Input Arrays:
1D, 3D,
2D (typical)

Output Arrays:
1D, 3D (slice),
2D (typical)

Vertex Processor (VP)


Kernel changes index
regions of output arrays

Rasterizer
Creates data streams
from index regions
5

ARCS 2008

The GPU is a Fast, Parallel Array Processor


Input Arrays:
1D, 3D,
2D (typical)

Output Arrays:
1D, 3D (slice),
2D (typical)

Vertex Processor (VP)


Kernel changes index
regions of output arrays

Rasterizer
Creates data streams
from index regions

Stream of array elements,


order unknown
6

ARCS 2008

The GPU is a Fast, Parallel Array Processor


Input Arrays:
1D, 3D,
2D (typical)

Output Arrays:
1D, 3D (slice),
2D (typical)

Vertex Processor (VP)

Fragment Processor (FP)

Kernel changes index


regions of output arrays

Kernel changes each


datum independently,
reads more input arrays

Rasterizer
Creates data streams
from index regions

Stream of array elements,


order unknown
7

ARCS 2008

The GPU is a Fast, Parallel Array Processor


Input Arrays:
1D, 3D,
2D (typical)

Output Arrays:
1D, 3D (slice),
2D (typical)

Vertex Processor (VP)

Fragment Processor (FP)

Kernel changes index


regions of output arrays

Kernel changes each


datum independently,
reads more input arrays

Rasterizer
Creates data streams
from index regions

Stream of array elements,


order unknown
8

ARCS 2008

The GPU is a Fast, Highly Multi-Threaded Processor


Input Arrays:
nD

Output Arrays:
nD

Start thousands of
parallel threads in
groups of m, e.g. 32

ARCS 2008

The GPU is a Fast, Highly Multi-Threaded Processor


Input Arrays:
nD

Start thousands of
parallel threads in
groups of m, e.g. 32

Output Arrays:
nD

Each group operates in a


SIMD fashion, with
predication if necessary

10

ARCS 2008

The GPU is a Fast, Highly Multi-Threaded Processor


Input Arrays:
nD

Output Arrays:
nD

Start thousands of
parallel threads in
groups of m, e.g. 32

Each group operates in a


SIMD fashion, with
predication if necessary

In general all threads are independent


but certain collections of groups may
use local memory to exchange data
11

Native Memory Layout Data Locality


CPU

GPU

1D input
1D output
Other dimensions with offsets

Input

ARCS 2008

2D input
2D output
Other dimensions with offsets

Input

Output

Color coded locality


red (near), blue (far)

Output
12

ARCS 2008

Primitive Index Regions in Output Arrays

Quads and Triangles

Output region

Fastest option

Output region

Line segments
Slower, try to pair lines to 2xh,
wx2 quads

Output region

Point Clouds
Slowest, try to gather points into
larger forms

13

ARCS 2008

GPUs are Optimized for Local Data Access


Memory access types: Cache, Sequential, Random

CPU
Large cache
Few processing elements
Optimized for spatial and
temporal data reuse

GPU
Small cache
Many processing elements
Optimized for sequential
(streaming) data access

GeForce 7800 GTX

Pentium 4

chart courtesy
of Ian Buck

14

ARCS 2008

Input and Output Arrays


CPU

GPU

Input and output arrays may


overlap

Input and output arrays must


not overlap

Input
Input
Output

Output

15

Configuration Overhead

Configuration
limited

ARCS 2008

Computation
limited

chart courtesy
of Ian Buck

16

Overview

ARCS 2008

Parallel Processing on GPUs


Types of Parallel Data Flow
Parallel Prefix or Scan
Precision and Accuracy

17

ARCS 2008

Parallel Data-Flow: Map, Gather and Scatter


Gather: x= f( a(1,2), a(3,5), )

Map: x= f(a)

Input

Output

Scatter: x(2,3)= f(a), x(6,7)= g(a),

Input

Output

Input

Output

General: x(2,3)= f( a(1,2), a(3,5), ),


x(6,7)= f( a(6,2), a(7,5), ),

Input

Output

18

Performance of Gather and Scatter

ARCS 2008

chart courtesy of
Naga Govindaraju

19

Parallel Reduction, e.g. Maximum of an Array

ARCS 2008

input

20

Parallel Reduction, e.g. Maximum of an Array

ARCS 2008

N/2input
x N/2array
output

21

Parallel Reduction, e.g. Maximum of an Array

ARCS 2008

gather 2x2
regions for
each output

22

Parallel Reduction, e.g. Maximum of an Array

ARCS 2008

first output

23

Parallel Reduction, e.g. Maximum of an Array

ARCS 2008

maximum of
2x2 region

24

Parallel Reduction, e.g. Maximum of an Array

ARCS 2008

intermediates

25

Parallel Reduction, e.g. Maximum of an Array

input

intermediates

ARCS 2008

result

26

Parallel Reduction, e.g. Maximum of an Array

ARCS 2008

For commutative operators (e.g. +,*,max) this is encapsulated into


a single function call.
For a more detailed discussion see, Mark Harris' CUDA
optimization talk from SC 2007: http://www.gpgpu.org/sc2007/

27

ARCS 2008

Overview
Parallel Processing on GPUs
Types of Parallel Data Flow
Parallel Prefix or Scan
Precision and Accuracy

slides courtesy of
Shubho Sengupta

28

ARCS 2008

Motivation
Stream Compaction
3

Split
T

Null

Motivation

Common scenarios in parallel computing


Variable output per thread
Threads want to perform a split radix sort, building trees

What came before/after me?


Where do I start writing my data?
Scan answers this question

ARCS 2008

ARCS 2008

Scan

Each element is a sum of all the elements to the left of it


(Exclusive)
Each element is a sum of all the elements to the left of it
and itself (Inclusive)

Input

11

11

15

16

22

Exclusive

11

11

15

16

22

25

Inclusive

Scan the past

ARCS 2008

First proposed in APL (1962)


Used as a data parallel primitive in the Connection
Machine (1990)
Guy Blelloch used scan as a primitive for various
parallel algorithms (1990)

Scan the present

ARCS 2008

First GPU implementation by Daniel Horn (2004), O(n


logn)
Subsequent GPU implementations by
Hensley (2005) O(n logn), Sengupta (2006) O(n), Gre (2006)
O(n) 2D

NVIDIA CUDA implementation by Mark Harris (2007),


O(n), space efficient

ARCS 2008

Scan - Reduce
3

11

14

log n steps
Work halves each
step
O(n) work

11

25

In place, space
efficient

ARCS 2008

Scan - Down Sweep


3

11

25

11

11

11

16

11

11

15

16

22

log n steps
Work doubles
each step
O(n) work
In place, space
efficient

ARCS 2008

Segmented Scan
Input
3

Scan within each segment in parallel


Output
0

Segmented Scan
Introduced by Schwartz (1980)
Forms the basis for a wide variety of algorithms

Radixsort, Quicksort
Sparse Matrix-Vector Multiply
Convex Hull
Solving recurrences
Tree operations

ARCS 2008

Segmented Scan Large Input

ARCS 2008

Segmented Scan Advantages

ARCS 2008

Operations in parallel over all the segments


Irregular workload since segments can be of any length
Can simulate divide-and-conquer recursion since
additional segments can be generated

ARCS 2008

Segmented Scan Variant: Tridiagonal Solver


Tridiagonal system of n rows solved in parallel
Then for each of the m columns in parallel
Read pattern is similar to but more complex than scan

Overview

ARCS 2008

Parallel Processing on GPUs


Types of Parallel Data Flow
Parallel Prefix or Scan
Precision and Accuracy

41

ARCS 2008

The Erratic Roundoff Error


Roundoff error for: 0 = f(a):= |(1+a)^3 - (1+3a^2) - (3a+a^3)|
-20

single precision
double precision

y = log2( f(a) ), 0 --> 2^-100

Smaller is better

-30
-40
-50
-60
-70
-80
-90
-100

10

20

30

40

50

x = log2( 1/a ), a = 1 / 2^x


42

Precision and Accuracy

ARCS 2008

There is no monotonic relation between the computational precision and


the accuracy of the final result.

Increasing precision can decrease accuracy !

The increase or decrease of precision in different parts of a computation


can have very different impact on the accuracy.

The above can be exploited to significantly reduce the precision in parts


of a computation without a loss in accuracy.

We obtain a mixed precision method.

43

Quantization - Preserving Accuracy

ARCS 2008

Watch out for cancellation


ab,

r = c*a-c*b
r = c(a-b)

Maximize operations on the same scale


a[0,1], b,c[10-3,10-4],

r = a+b+c
r = a+(b+c)

Make implicit relations between constants explicit


ai=0.01, i=0..99
a99=1-(i<99ai ),

r = i<100ai 1
r = i<100ai = 1

Use symmetric intervals for multiplication


a ~ [-1, 1],

r = 0.1134*(a+1)
r = 0.1134a+0.1134

Minimize the number of multiplications

r = 0.25a + 0.1b + 0.15c


r = 0.1(a+b)+0.15(a+c)
44

ARCS 2008

Resources for Signed Integer Operations


Operation

Area

Latency

min(r,0)
max(r,0)

b+1

add(r1,r2)
sub(r1,r2)

2b

add(r1,r2,r3)add(r4,r5)

2b

mult(r1,r2)
sqr(r)

b(b-2)

b ld(b)

2c(c-5)

c(c+3)

sqrt(r)

b: bitlength of argument, c: bitlength of result


45

FPGA Results: Conjugate Gradient (CG)

ARCS 2008

Area of s??e11 float kernels on the xc2v500/xc2v8000(CG)


1600

Adder
Multiplier
CG kernel normalized (1/30)

1200
Number of slices

Smaller is better

1400

1000
800
600
400
200
0

20

25

30

35

40

45

50

Bits of mantissa
[Gddeke et al. Performance and accuracy of hardware-oriented native-, emulated- and mixedprecision solvers in FEM simulations, IJPEDS 2007]

46

ARCS 2008

High Precision Emulation

Given a m x m bit unsigned integer multiplier we want to build


a n x n multiplier with a n=k*m bit result
k

i =1

j =1

im
jm
2
a

2
i bj =

( i + j ) m
2
ai b j +

i , j =1
i + j k +1

( i + j ) m
2
ai b j

i , j =1
i + j >k +1

The evaluation of the first sum requires k(k+1)/2 multiplications,


the evaluation of the second depends on the rounding mode
For floating point numbers additional operations are necessary
because of the mantissa/exponent distinction
A double emulation with two aligned s23e8 single floats is less
complex than an exact s52e11 double emulation, achieves a
s46e8 precision and still requires 10-20 single float operations
47

ARCS 2008

Precision Performance Rough Estimates


Reconfigurable device, e.g. FPGA
2x float add 1x double add
4x float mul 1x double mul

Hardware emulation (compute area limited), e.g. GPU


2x float add 1x double add
5x float mul 1x double mul

Hardware emulation (data path limited), e.g. CPU


2x float add 1x double add
2x float mul 1x double mul

Software emulation
10x float add 1x double add
20x float mul 1x double mul

48

ARCS 2008

Mixed Precision Iterative Refinement Ax=b

Exploit the speed of low precision and obtain a result of


high accuracy
dk =b-Axk
Ack=dk
xk+1=xk+ck
k=k+1

Compute in high precision (cheap)


Solve in low precision (fast)
Correct in high precision (cheap)
Iterate until convergence in high precision

Low precision solution is used as a pre-conditioner in a


high precision iterative method
A is small and dense: Solve Ack=dk directly
A is large and sparse: Solve (approximately) Ack=dk with an iterative
method itself

49

Larger is better

CPU Results: LU Solver

ARCS 2008

chart courtesy
of Jack Dongarra

[Langou et al. Exploiting the performance of 32 bit floating point arithmetic in obtaining 64 bit accuracy
(revisiting iterative refinement for linear systems), SC 2006]

50

ARCS 2008

GPU Results: Conjugate Gradient (CG) and Multigrid (MG)


Performance of double precision CPU and mixed precision CPU-GPU solvers

Seconds per grid node

Smaller is better

5e-4

5e-5

CG CPU
CG GPU
MG2+2 CPU
MG2+2 GPU

5e-6

5e-7

10

Data level
[Gddeke et al. Performance and accuracy of hardware-oriented native-, emulated- and mixedprecision solvers in FEM simulations, IJPEDS 2007]

51

Conclusions

ARCS 2008

Parallel Processing on GPUs is about identifying


independent work and preserving data locality
Map, gather, scatter are basic types of parallel data-flow.
Parallel prefix (scan) enables the parallelization of many
seemingly inherently sequential algorithms
Precision accuracy! Mixed precision methods can
reduce resource requirements quadratically.

52

You might also like