CKV
Advanced VLSI Architecture
MEL G624
Lecture 30: Data Level Parallelism
CKV
Programming the GPU
To distinguish between functions for GPU (device) and
System processor (host)
_device_ or _global_ => GPU Device
_host_ => System processor (host)
Variables declared in _device_ or _global_ functions are
allocated to GPU memory
Function Call: name<<<dimGrid, dimBlock>>>..parameters list..)
threadIdx: Identifier for threads per block
blockIdx: Identifier for blocks
blockDim: Number of threads per block
CKV
Programming the GPU
//Invoke DAXPY
daxpy(n,2.0,x,y);
//DAXPY in C
void daxpy(int n, double a, double* x, double* y){
for (int i=0;i<n;i++)
y[i]= a*x[i]+ y[i]
}
//Invoke DAXPY with 256 threads per Thread Block
_host_
int nblocks = (n+255)/256;
daxpy<<<nblocks, 256>>> (n,2.0,x,y);
//DAXPY in CUDA
_device_
void daxpy(int n,double a,double* x,double* y){
int i=blockIDx.x*blockDim.x+threadIdx.x;
if (i<n) y[i]= a*x[i]+ y[i]
}
CKV
Nvidia GPU Instruction Set Architecture
Instruction Set target of Nvidia compiler is an abstraction
of the hardware instruction set
PTX (Parallel Thread Execution)
Stable Instruction set for compilers
Compatibility across generations of GPU
PTX describe operations on single CUDA thread
One-to-one map with H/W instruction
One PTX expand to many machine instructions
PTX uses virtual registers (compiler maps physical registers to thread)
CKV
Nvidia GPU Instruction Set Architecture
//Invoke DAXPY with 256 threads per Thread Block
_host_
int nblocks = (n+255)/256;
daxpy<<<nblocks, 256>>> (n,2.0,x,y);
//DAXPY in CUDA
_device_
void daxpy(int n,double a,double* x,double* y){
int i=blockIDx.x*blockDim.x+threadIdx.x;
if (i<n) y[i]= a*x[i]+ y[i]
}
shl.s32 R8, blockIdx, 8 ; Thread Block ID * Block size (256 or 28)
add.s32 R8, R8, threadIdx ; R8 = i = my CUDA thread ID
shl.s32 R8, R8, 3 ;Byte offset
ld.global.f64 RD0, [X+R8] ; RD0 = X[i]
ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i]
mul.f64 R0D, RD0, RD4 ; Product in RD0 = RD0 * RD4 (scalar a)
add.f64 R0D, RD0, RD2 ; Sum in RD0 = RD0 + RD2 (Y[i])
st.global.f64 [Y+R8], RD0 ; Y[i] = sum (X[i]*a + Y[i])
CKV
Conditional Branching
Like vector arch, GPU branch hardware uses internal masks
Branch synchronization stack
Entries consist of masks for each SIMD lane
i.e. which threads commit their results (all threads execute)
Instruction markers to manage when a branch diverges into
multiple execution paths
Push on divergent branch
.. and when paths converge
Act as barriers, Pops stack
Per‐thread‐lane 1‐bit predicate register,
CKV
Conditional Branching
if (X[i] != 0)
X[i] = X[i] – Y[i];
else X[i] = Z[i];
ld.global.f64 RD0, [X+R8] ; RD0 = X[i]
setp.neq.s32 P1, RD0, #0 ; P1 is predicate register 1
@!P1, bra ELSE1, *Push ; Push old mask, set new mask bits
; if P1 false, go to ELSE1
ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i]
sub.f64 RD0, RD0, RD2 ; Difference in RD0
st.global.f64 [X+R8], RD0 ; X[i] = RD0
@P1, bra ENDIF1, *Comp ; complement mask bits
; if P1 true, go to ENDIF1
ELSE1: ld.global.f64 RD0, [Z+R8] ; RD0 = Z[i]
st.global.f64 [X+R8], RD0 ; X[i] = RD0
ENDIF1: <next instruction>, *Pop ; pop to restore old mask
CKV
NVIDIA GPU Memory Structures
Each SIMD Lane has private section of off-chip DRAM
“Private memory”, not shared by any other lanes
Contains stack frame, spilling registers, and private variables
Recent GPUs cache this in L1 and L2 caches
Each MT SIMD processor also has “Local memory” that is on
chip
Shared by SIMD lanes / threads within a block only
The off-chip memory shared by SIMD processors is “GPU
Memory”
Host can read and write GPU memory
CKV
NVIDIA GPU Memory Structures
CKV
Detecting and Enhancing Loop Level Parallelism
Focuses on determining whether data accesses in later
iterations are dependent on data values produced in
earlier iterations
Loop‐carried dependence
Loop‐level parallelism has no loop‐carried dependence
for (i=999; i>=0; i=i‐1)
x[i] = x[i] + s;
No Loop‐carried dependence
CKV
Detecting and Enhancing Loop Level Parallelism
for (i=0; i<100; i=i+1) {
A[i+1] = A[i] + C[i]; /* S1 */
B[i+1] = B[i] + A[i+1]; /* S2 */
}
S1 uses value computed by S1 in earlier iteration
Iteration i computes A[i+1] (& B[i+1]) which is used in iteration i+1
Loop‐carried dependence
S2 uses value computed by S1 in same iteration
No Loop‐carried dependence
CKV
Detecting and Enhancing Loop Level Parallelism
for (i=0; i<100; i=i+1) {
A[i] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i]; /* S2 */
}
S1 uses value computed by S2 in previous iteration, but
this dependence is not circular.
Neither statement depend on itself
Although S1 Depends on S2, S2 does not depend on S1,
Interchanging the two statements will not affect the
execution of S2
CKV
Detecting and Enhancing Loop Level Parallelism
for (i=0; i<100; i=i+1) {
A[i] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i]; /* S2 */
}
A[0] = A[0] + B[0];
for (i=0; i<99; i=i+1) {
B[i+1] = C[i] + D[i]; /*S2*/
A[i+1] = A[i+1] + B[i+1]; /*S1*/
}
B[100] = C[99] + D[99];
CKV
Detecting and Enhancing Loop Level Parallelism
for (i=0;i<100;i=i+1) {
A[i] = B[i] + C[i];
D[i] = A[i] * E[i];
}
The second reference to A needs not be translated to a load
instruction.
The two reference are the same. There is no intervening memory
access to the same location
A more complex analysis, i.e. Loop‐carried dependence
analysis + data dependence analysis, can be applied in the
same basic block to optimize
CKV
Detecting and Enhancing Loop Level Parallelism
Recurrence is a special form of loop‐carried dependence.
for (i=1;i<100;i=i+1) {
Y[i] = Y[i‐1] + Y[i];
}
Detecting a recurrence is important
Some vector computers have special support for
executing recurrence
It my still be possible to exploit a fair amount of ILP
CKV
Thank You for Attending