Reduction
Reduction
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Course Organization
Topic Week Hours
Review of basic COA w.r.t. performance 1 2
Intro to GPU architectures 2 3
Intro to CUDA programming 3 2
Multi-dimensional data and synchronization 4 2
Warp Scheduling and Divergence 5 2
Memory Access Coalescing 6 2
Optimizing Reduction Kernels 7 3
Kernel Fusion, Thread and Block Coarsening 8 3
OpenCL - runtime system 9 3
OpenCL - heterogeneous computing 10 2
Efficient Neural Network Training/Inferencing 11-12 6 TE
OF
TECHNO
LO
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Recap
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Recap
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Parallel Patterns
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction Algorithm
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Serial and Parallel Implementation
A sequential version A parallel version
I O(n) I O(log2 n)
I for(int i = 0, i < n, ++i) ... I “tree”-based implementation
Array Array
+ + +
+
+
+
Total Total
Sum Sum TE
OF
TECHNO
LO
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Parallel Reduction Algorithm
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Kernel Decomposition
I Decompose computation
into multiple kernel
invocations
I Kernel launch serves as a
global synchronization
point
I Negligible HW overhead,
low SW overhead
Figure from ’Optimizing Parallel
Reduction in CUDA’ by Mark
Harris
TECHNO
OF LO
TE
GY
ITU
Figure: Multiple Kernel Invocations
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Optimization In Reduction
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 1: Interleaved Addressing
Divergent Branch TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 1: Kernel
__global__ void reduce1 ( int * g_idata , int * g_odata , unsigned int n ) {
extern __shared__ int sdata [];
// load shared mem
unsigned int tid = threadIdx . x ;
unsigned int i = blockIdx . x * blockDim . x + threadIdx . x ;
sdata [ tid ] = ( i < n ) ? g_idata [ i ] : 0;
__syncthreads () ;
// do reduction in shared mem
for ( unsigned int s =1; s < blockDim . x ; s *= 2)
{ // modulo arithmetic is slow !
if (( tid % (2* s ) ) == 0)
sdata [ tid ] += sdata [ tid + s ];
__syncthreads () ;
}
// write result for this block to global mem
if ( tid == 0)
g_odata [ blockIdx . x ] = sdata [0]; TE
OF
TECHNO
LO
GY
ITU
}
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 1: Host
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 1: Host Code for Multiple Kernel Launch
... // c u d a M e m c p y H o s t T o D e v i c e ...
int t hr e ad sP e rB lo c k = 64;
int old_blocks , blocks = ( N / t hr ea d sP e rB lo c k ) / 2;
blocks = ( blocks == 0) ? 1 : blocks ;
old_blocks = blocks ;
while ( blocks > 0) // call compute kernel
{
sum < < < blocks , threadsPerBlock , th r ea ds P er Bl o ck * sizeof ( int ) > > >( devPtrA ) ;
old_blocks = blocks ;
blocks = ( blocks / th re a ds P er Bl o ck ) / 2;
};
if ( blocks == 0 && old_blocks != 1) // final kernel call , if still needed
sum < < <1 , old_blocks /2 , ( old_blocks /2) * sizeof ( int ) > > >( devPtrA ) ;
... // c u d a M e m c p y D e v i c e T o H o s t ...
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 1: Analysis
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 2: Interleaved Addressing
Branch
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 1: Kernel
__global__ void reduce1 ( int * g_idata , int * g_odata , unsigned int n ) {
extern __shared__ int sdata [];
// load shared mem
unsigned int tid = threadIdx . x ;
unsigned int i = blockIdx . x * blockDim . x + threadIdx . x ;
sdata [ tid ] = ( i < n ) ? g_idata [ i ] : 0;
__syncthreads () ;
// do reduction in shared mem
for(unsigned int s=1; s < blockDim.x; s *= 2)
{ // modulo arithmetic is slow !
if((tid % (2*s)) == 0)
sdata [ tid ] += sdata [ tid + s ];
__syncthreads () ;
}
// write result for this block to global mem
if ( tid == 0)
TECHNO
OF LO
GY
ITU
IAN INST
KH
ARAGPUR
IND
}
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 2: Kernel
__global__ void reduce2 ( int * g_idata , int * g_odata , unsigned int n ) {
extern __shared__ int sdata [];
// load shared mem
unsigned int tid = threadIdx . x ;
unsigned int i = blockIdx . x * blockDim . x + threadIdx . x ;
sdata [ tid ] = ( i < n ) ? g_idata [ i ] : 0;
__syncthreads () ;
// do reduction in shared mem
for (unsigned int s=1; s < blockDim.x; s *= 2){
int index = 2 * s * tid;
if (index < blockDim.x)
sdata [ index ] += sdata [ index + s ];
__syncthreads () ;
}
// write result for this block to global mem
if ( tid == 0) OF
TECHNO
LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
}
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 2: Analysis
Interleaved addressing with divergent branching
Problems:
I highly divergent Array Size: 226
I warps are very Threads/Block: 1024
inefficient GPU used: Tesla K40m
I % operator is very
slow
I half of the threads Reduction Time Bandwidth
does nothing! Unit Second GB/Second
Reduce 1 0.03276 8.1951
I loop is expensive
Reduce 2 0.02312 11.6117
shared memory
TECHNO
OF
I
LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
bank conflicts 19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Shared Memory Bank Conflict
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 3: Sequential Addressing
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 2: Kernel
__global__ void reduce2 ( int * g_idata , int * g_odata , unsigned int n ) {
extern __shared__ int sdata [];
// load shared mem
unsigned int tid = threadIdx . x ;
unsigned int i = blockIdx . x * blockDim . x + threadIdx . x ;
sdata [ tid ] = ( i < n ) ? g_idata [ i ] : 0;
__syncthreads () ;
// do reduction in shared mem
for (unsigned int s=1; s < blockDim.x; s *= 2){
int index = 2 * s * tid;
if (index < blockDim.x)
sdata [ index ] += sdata [ index + s ];
__syncthreads () ;
}
// write result for this block to global mem
if ( tid == 0) OF
TECHNO
LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
}
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 3: Kernel
__global__ void reduce3 ( int * g_idata , int * g_odata , unsigned int n ) {
extern __shared__ int sdata [];
// load shared mem
unsigned int tid = threadIdx . x ;
unsigned int i = blockIdx . x * blockDim . x + threadIdx . x ;
sdata [ tid ] = ( i < n ) ? g_idata [ i ] : 0;
__syncthreads () ;
// do reduction in shared mem
for (unsigned int s=blockDim.x/2; s>0; s»=1)
{
if (tid < s)
sdata [ tid ] += sdata [ tid + s ];
__syncthreads () ;
}
// write result for this block to global mem
if ( tid == 0)
TECHNO
OF LO
GY
ITU
IAN INST
KH
ARAGPUR
IND
}
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 3: Analysis
Interleaved addressing with divergent branching
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 4: First Add During Load
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 3: Kernel
__global__ void reduce3 ( int * g_idata , int * g_odata , unsigned int n ) {
extern __shared__ int sdata [];
// load shared mem
unsigned int tid = threadIdx . x ;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = (i < n) ? g_idata[i] : 0;
__syncthreads () ;
// do reduction in shared mem
for ( unsigned int s = blockDim . x /2; s >0; s > >=1)
{
if ( tid < s )
sdata [ tid ] += sdata [ tid + s ];
__syncthreads () ;
}
// write result for this block to global mem
if ( tid == 0)
TECHNO
OF LO
GY
ITU
IAN INST
KH
ARAGPUR
IND
}
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 4: Kernel
__global__ void reduce4 ( int * g_idata , int * g_odata , unsigned int n ) {
extern __shared__ int sdata [];
// reading from global memory , writing to shared memory
unsigned int tid = threadIdx . x ;
unsigned int i = blockIdx.x * (blockDim.x * 2) + threadIdx.x;
sdata[tid] = g_idata[i] + g_idata[i+blockDim.x];
__syncthreads () ;
// do reduction in shared mem
for ( unsigned int s = blockDim . x /2; s >0; s > >=1) {
if ( tid < s )
sdata [ tid ] += sdata [ tid + s ];
__syncthreads () ;
}
// write result for this block to global mem
if ( tid == 0)
g_odata [ blockIdx . x ] = sdata [0]; TE
OF
TECHNO
LO
GY
ITU
IAN INST
KH
}
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 4: Analysis
Memory bandwidth is still underutilized
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 5: Unrolling the Last Warp
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 4: Kernel
__global__ void reduce4 ( int * g_idata , int * g_odata , unsigned int n ) {
extern __shared__ int sdata [];
// reading from global memory , writing to shared memory
unsigned int tid = threadIdx . x ;
unsigned int i = blockIdx . x * ( blockDim . x * 2) + threadIdx . x ;
sdata [ tid ] = g \ _idata [ i ] + g \ _idata [ i + blockDim . x ];
__syncthreads () ;
// do reduction in shared mem
for ( unsigned int s = blockDim . x /2; s>0; s > >=1) {
if ( tid < s )
sdata [ tid ] += sdata [ tid + s ];
__syncthreads () ;
}
// write result for this block to global mem
if ( tid == 0)
g_odata [ blockIdx . x ] = sdata [0];
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 5: Kernel
__global__ void reduce5 ( int * g_idata , int * g_odata , unsigned int n ) {
extern __shared__ int sdata [];
// reading from global memory , writing to shared memory
unsigned int tid = threadIdx . x ;
unsigned int i = blockIdx . x * ( blockDim . x * 2) + threadIdx . x ;
sdata [ tid ] = g_idata [ i ] + g_idata [ i + blockDim . x ];
__syncthreads () ;
// do reduction in shared mem
for ( unsigned int s = blockDim . x /2; s>32; s > >=1) {
if ( tid < s )
sdata [ tid ] += sdata [ tid + s ];
__syncthreads () ;
}
if (tid < 32)
warpReduce(sdata, tid);
// write result for this block to global mem TECHNO
OF LO
TE
if ( tid == 0)
GY
ITU
IAN INST
KH
ARAGPUR
IND
g_odata [ blockIdx . x ] = sdata [0];
19 5 1
}
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 5: warpReduce
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 5: Analysis
Array Size: 226
Threads/Block: 1024
GPU used: Tesla K40m
Problems:
Reduction Time Bandwidth
I Still have iterations
Unit Second GB/Second
I loop overhead Reduce 1 0.03276 8.1951
Reduce 2 0.02312 11.6117
Reduce 3 0.01939 13.839
Reduce 4 0.01104 24.3098
Reduce 5 0.00836 32.1053 TE
OF
TECHNO
LO
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 6: Complete Unrolling
If number of iterations is known at compile time, could completely unroll the reduction.
I Block size is limited to 512 or 1024 threads
I Block size should be of power-of-2
I For a fixed block size, complete unrolling is easy
I For generic implementation, solution is-
I CUDA supports C++ template parameters on device and host functions
I Block size can be specified as a function template parameter
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 5: Kernel
__global__ void reduce5 ( int * g_idata , int * g_odata , unsigned int n ) {
extern __shared__ int sdata [];
// reading from global memory , writing to shared memory
unsigned int tid = threadIdx . x ;
unsigned int i = blockIdx . x * ( blockDim . x * 2) + threadIdx . x ;
sdata [ tid ] = g_idata [ i ] + g_idata [ i + blockDim . x ];
__syncthreads () ;
// do reduction in shared mem
for (unsigned int s=blockDim.x/2; s>32; s»=1) {
if ( tid < s )
sdata [ tid ] += sdata [ tid + s ];
__syncthreads () ;
}
if ( tid < 32)
warpReduce ( sdata , tid ) ;
// write result for this block to global mem
if ( tid == 0) TE
OF
TECHNO
LO
GY
ITU
g_odata [ blockIdx . x ] = sdata [0];
IAN INST
KH
ARAGPUR
IND
} 19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 6: Kernel
Specify block size as a function template parameter and all code highlighted in
yellow will be evaluated at compile time.
template < unsigned int blockSize >
__global__ void reduce6 ( int * g_idata , int * g_odata , unsigned int n ) {
extern __shared__ int sdata [];
// reading from global memory , writing to shared memory
unsigned int tid = threadIdx . x ;
unsigned int i = blockIdx . x * ( blockDim . x * 2) + threadIdx . x ;
sdata [ tid ] = g_idata [ i ] + g_idata [ i + blockDim . x ];
__syncthreads () ;
GY
ITU
IAN INST
KH
ARAGPUR
IND
}
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 6: Kernel
if (blockSize >= 256) {
if ( tid < 128)
sdata [ tid ] += sdata [ tid + 128];
__syncthreads () ;
}
if (blockSize >= 128) {
if ( tid < 64)
sdata [ tid ] += sdata [ tid + 64];
__syncthreads () ;
}
if ( tid < 32)
warpReduce < blockSize >( sdata , tid ) ;
GY
ITU
IAN INST
KH
ARAGPUR
}
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 6: Kernel
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 6: Invoking Template Kernels
Use a switch statement for possible block sizes while invoking template kernels
switch ( th re a ds P er Bl o ck ) {
case 512: reduce5 <512 > < < < dimGrid , dimBlock , smemSize > > >( d_idata , d_odata ) ;
break ;
case 256: reduce5 <256 > < < < dimGrid , dimBlock , smemSize > > >( d_idata , d_odata ) ;
break ;
case 128: reduce5 <128 > < < < dimGrid , dimBlock , smemSize > > >( d_idata , d_odata ) ;
break ;
case 64: reduce5 <64 > < < < dimGrid , dimBlock , smemSize > > >( d_idata , d_odata ) ;
break ;
case 32: reduce5 <32 > < < < dimGrid , dimBlock , smemSize > > >( d_idata , d_odata ) ;
break ;
case 16: reduce5 <16 > < < < dimGrid , dimBlock , smemSize > > >( d_idata , d_odata ) ;
break ;
case 8: reduce5 <8 > < < < dimGrid , dimBlock , smemSize > > >( d_idata , d_odata ) ;
break ;
case 4: reduce5 <4 > < < < dimGrid , dimBlock , smemSize > > >( d_idata , d_odata ) ; break ;
TECHNO
case 2: reduce5 <2 > < < < dimGrid , dimBlock , smemSize > > >( d_idata , d_odata ) ; break ; TE
OF LO
GY
ITU
IAN INST
KH
ARAGPUR
case 1: reduce5 <1 > < < < dimGrid , dimBlock , smemSize > > >( d_idata , d_odata ) ; break ;
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 6: Analysis
Array Size: 226
Threads/Block: 1024
GPU used: Tesla K40m
GY
ITU
IAN INST
KH
ARAGPUR
Reduce 6 0.00769 34.9014
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 7: Multiple Adds / Thread
Algorithm Cascading:
I Combine sequential and parallel reduction
I Each thread loads and sums multiple elements into shared memory
I Tree-based reduction in shared memory
I Replace load and add two elements
I With a loop to add as many as necessary
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 6: Kernel
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 7: Kernel
__global__ void reduce7 ( int * g_idata , int * g_odata , unsigned int n )
{
extern __shared__ int sdata [];
// reading from global memory , writing to shared memory
unsigned int tid = threadIdx . x ;
unsigned int i = blockIdx . x * ( blockDim . x * 2) + threadIdx . x ;
unsigned int gridSize = blockSize*2*gridDim.x;
sdata[tid] = 0;
while (i < n) {
sdata[tid] += g_idata[i] + g_idata[i+blockSize];
i += gridSize;
}
__syncthreads () ;
// do reduction in shared mem
...
TECHNO
OF
GY
ITU
IAN INST
KH
ARAGPUR
...
IND
19 5 1
}
yog, kms kOflm^
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 7: Analysis
Array Size: 226
Threads/Block: 1024
GPU used: Tesla K40m
GY
ITU
IAN INST
KH
ARAGPUR
Reduce 7 0.00277 96.8672
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction Performance Comparison wrt time in Tesla K40m GPU
Data Size (n)
2^19
60 2^20
2^21
2^22
50 2^23
2^24
2^25
40 2^26
2^27
Time (ms)
30
20
10
0 TECHNO
OF LO
TE
0 1 2 3 4 5 6
GY
ITU
IAN INST
KH
ARAGPUR
Reduction Level
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction Performance Comparison wrt bandwidth in Tesla K40m GPU
100 Data Size (n)
2^19
2^20
2^21
80 2^22
2^23
2^24
2^25
2^26
Bandwidth (GB/s)
60
2^27
40
20
TECHNO
OF LO
TE
0 1 2 3 4 5 6
GY
ITU
IAN INST
KH
ARAGPUR
Reduction Level
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Types of Optimization
I Algorithmic optimizations
I Changes to addressing, algorithm cascading (Reduction No. 1 to 4, 7)
I Approx 12x speedup
I Code optimizations
I Loop unrolling (Reduction No. 5, 6)
I Approx 3x speedup
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Summary
Kernel Optimization
Reduce1 Interleaved addressing (using modulo arithmetic) with divergent branching
Reduce2 Interleaved addressing (using contiguous threads) with bank conflicts
Reduce3 Sequential addressing, no divergence or bank conflicts
Reduce4 Uses n/2 threads, performs first level during global load
Reduce5 Unrolled loop for last warp, intra-warp synchronisation barriers removed
Completely unrolled, using template parameter to assert whether the number
Reduce6
of threads is a power of two
Multiple elements per thread, small constant number of thread blocks
Reduce7
launched. Requires very few synchronisation barriers OF
TECHNO
LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Example of Applications on Reduction
I Bitonic Sort
I Prefix sum
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Problem: Sorting
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Sorting Networks
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Comparator
a min(a,b) a max(a,b)
b max(a,b) b min(a.b)
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
A Simple Sorting Network
b c c
c d b b b
d c c
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Bubble Sort
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Bitonic Sort
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Bitonic Sort
7 7 7 7 7 3 2
9 9 8 8 4 2 3
8 11 11 9 3 7 4
11 8 9 11 2 4 7
12 3 4 12 12 9 8
3 12 12 4 8 8 9
4 4 3 3 9 12 11
2 2 2 2 11 11 12
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
Step I Step II
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Recursive Structure
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Recursive C Program
// Comparator
void compare ( int i , int j , boolean dir ) {
if ( dir ==( a [ i ] > a [ j ]) )
exchange (i , j ) ;
}
// Step II
void bitonicMerge ( int lo , int n , boolean dir ) {
if (n >1) {
int m = n /2;
for ( int i = lo ; i < lo + m ; i ++)
compare (i , i +m , dir ) ;
bitonicMerge ( lo , m , dir ) ;
bitonicMerge ( lo +m , m , dir ) ;
}
} OF
TECHNO
LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Recursive C Program
// Step I
void bitonicSort ( int lo , int n , boolean dir ) {
if (n >1)
{
int m = n /2;
bitonicSort ( lo , m , ASCENDING ) ;
bitonicSort ( lo +m , m , DESCENDING ) ;
bitonicMerge ( lo , n , dir ) ;
}
}
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Scope for Parallelization
7 7 7 7 7 3 2
9 9 8 8 4 2 3
8 11 11 9 3 7 4
11 8 9 11 2 4 7
12 3 4 12 12 9 8
3 12 12 4 8 8 9
4 4 3 3 9 12 11
2 2 2 2 11 11 12 TE
OF
TECHNO
LO
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Formulate Parallel Solution
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Mapping Sorting Subproblem
7 7 7 7 7 3 2
9 9 8 8 4 2 3
8 11 11 9 3 7 4
11 8 9 11 2 4 7 Thread Block
4 threads for
12 3 4 12 12 9 8 eight elements
3 12 12 4 8 8 9
Shared Memory Size
4 4 3 3 9 12 11 =
2 * Number of Threads
2 2 2 2 11 11 12 TE
OF
TECHNO
LO
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Comparator
__device__ inline void Comparator ( uint & keyA , uint & valA , uint & keyB , uint & valB
, uint dir ) {
uint t ;
if (( keyA > keyB ) == dir ) {
t = keyA ;
keyA = keyB ;
keyB = t ;
t = valA ;
valA = valB ;
valB = t ;
}
}
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Sort in Shared Memory
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
s_key [ threadIdx . x +0]= d_SrcKey [0];
s_val [ threadIdx . x +0]= d_SrcVal [0];
s_key [ threadIdx . x + SHARED_SIZE /2]= d_SrcKey [( SHARED_SIZE /2];
s_val [ threadIdx . x + SHARED_SIZE /2]= d_SrcVal [( SHARED_SIZE /2];
// strided load by threads , each thread loads two elements
for ( size =2; size < SHARED_SIZE ; size < <=1) {
// Bitonic merge
uint ddd =( threadIdx . x &( size /2) ) !=0;
for ( stride = size /2; stride >0; stride > >=1) {
__syncthreads () ;
pos =2* threadIdx .x - threadIdx . x &( stride -1) ;
Comparator ( s_key [ pos +0] , s_val [ pos +0] ,
s_key [ pos + stride ] , s_val [ pos + stride ] , ddd ) ;
}
}
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Important Variables
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Bitonic Sequence Creation
size = 2 size = 4
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
// sort in opposite directions odd / even block ids
uint ddd = ( blockIdx . x + 1) & 1;
for ( stride = SHARED_SIZE /2; stride >0; stride > >=1) {
__syncthreads () ;
pos =2* threadIdx .x - threadIdx . x &( stride - 1) ) ;
Comparator ( s_key [ pos +0] , s_val [ pos +0] ,
s_key [ pos + stride ] , s_val [ pos + stride ] , ddd ) ;
}
__syncthreads () ;
d_DstKey [0]= s_key [ threadIdx . x +0];
d_DstVal [0]= s_val [ threadIdx . x +0];
d_DstKey [ SHARED_SIZE /2]= s_key [ threadIdx . x + SHARED_SIZE /2];
d_DstVal [ SHARED_SIZE /2]= s_val [ threadIdx . x + SHARED_SIZE /2];
}
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Sorting Bitonic Sequence
Block
= (0 + 1)1 = 1
stride=4 stride=2 stride=
7 7 3 2
8 4 2 3
9 3 7 4
11 2 4 7
12 12 9 8
4 8 8 9
3 9 12 11
2 11 11 12 TE
OF
TECHNO
LO
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Example Applications of reduction
I Bitonic Sort
I Prefix Sum
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
All-Prefix-Sums
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Inclusive and Exclusive scan
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Example - Inclusive Scan
0 1 2 3 4 5 6 7 Array Indexes
3 1 7 0 4 1 6 3 Input Array
3 1 7 0 4 1 6 3
3 4 7 0 4 1 6 3
3 4 11 0 4 1 6 3
Output Array
3 4 11 11 4 1 6 3
3 4 11 11 15 1 6 3
3 4 11 11 15 16 6 3
3 4 11 11 15 16 22 3
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
3 4 11 11 15 16 22 25
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Sequential Code
Inclusive Scan
void scan ( float * output , float * input , int length )
{
output [0] = input [0]; // since this is a inclusive scan
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Sequential Code - Complexity
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Parallel Prefix Sum - Hillis/Steel Scan (Algorithm 1)
for d = 1 to log2 n do
forall k ≥ in parallel do
x[k] = x[k − 2d−1 ] + x[k]
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Example- Hillis/Steel Inclusive Scan
1 2 3 4 5 6 7 8
Step-I
Add elements 20
step away
1 3 5 7 9 11 13 15
Step-II
Add elements 21
step away
1 3 6 10 14 18 22 26
Step-III
Add elements 22
step away
1 3 6 10 15 21 28 36
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
Nos of steps O(log n)
IND
Work O (n * log n)
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Analysis
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
A double-buffered version of the sum scan from Algorithm 1
Algorithm 2
for d := 1 to log2 n do
forall k in parallel do
if k ≥ 2d then
x[out][k] := x[in][k-2d−1 ] + x[in][k]
else
x[out][k] := x[in][k]
swap(in,out)
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
CUDA C code - Algorithm 1
global__ void scan ( float * g_odata , float * g_idata , int n )
{
extern __shared__ float temp []; // allocated on invocation
int thid = threadIdx . x ;
int pout = 0 , pin = 1;
// For exclusive scan , shift right by one and set first elt to 0
temp [ pout * n + thid ] = ( thid > 0) ? g_idata [ thid -1] : 0;
__syncthreads () ;
for ( int offset = 1; offset < n ; offset *= 2)
{
pout = 1 - pout ;
pin = 1 - pout ;
if ( thid >= offset )
temp [ pout * n + thid ] += temp [ pin * n + thid - offset ];
else
temp [ pout * n + thid ] = temp [ pin * n + thid ];
__syncthreads () ;
} TE
OF
TECHNO
LO
GY
ITU
IAN INST
KH
ARAGPUR
g_odata [ thid ] = temp [ pout * n + thid1 ];
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Parallel Prefix Sum - Blelloch Scan
I The idea is to build a balanced binary tree on the input data and sweep it to and
from the root to compute the prefix sum.
I A binary tree with n leaves has logn levels, and each level d ∈ [0, n) has 2d nodes.
I If we perform one add per node, then we will perform O(n) adds on a single
traversal of the tree.
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Parallel Prefix Sum - Blelloch Scan
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Blelloch Scan - Reduce Phase
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Blelloch Scan - Reduce Phase
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Blelloch Scan - Down-Sweep Phase
x[n − 1] := 0
for d:= log2 n down to 0 do
for k from 0 to n − 1 by 2d+1 in parallel do
t := x[k + 2d − 1]
x[k + 2d − 1] := x[k + 2d+1 − 1]
x[k + 2d+1 − 1] := t + x[k + 2d+1 − 1]
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Blelloch Scan - Down-Sweep Phase
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Example - Blelloch Exclusive Scan
1 2 3 4 5 6 7 8
Input
3 7 11 15 Reduce Step 0
10 26 Reduce Step 1
36 Reduce Step 2
10 0 Identity Elements
3 0 11 10 Down Sweep 0
1 0 3 3 5 10 7 21 Down Sweep 1
TECHNO
OF LO
TE
Down Sweep 2
GY
ITU
0 1 3 6 10 15 21 28
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Cuda code
__global__ void prescan ( float * g_odata , float * g_idata , int n )
{
extern __shared__ float temp []; // allocated on invocation
int thid = threadIdx . x ;
int offset = 1;
temp [2* thid ] = g_idata [2* thid ]; // load input into shared memory
temp [2* thid +1] = g_idata [2* thid +1];
for ( int d = n > >1; d > 0; d > >= 1) // build sum in place up the tree
{
__syncthreads () ;
if ( thid < d )
{
int ai = offset *(2* thid +1) -1;
int bi = offset *(2* thid +2) -1;
temp [ bi ] += temp [ ai ];
}
TECHNO
OF
offset *= 2; TE
LO
GY
ITU
IAN INST
KH
ARAGPUR
IND
}
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Cuda code
if ( thid == 0)
temp [ n - 1] = 0; // clear the last element
for ( int d = 1; d < n ; d *= 2) // traverse down tree & build scan
{
offset > >= 1;
__syncthreads () ;
if ( thid < d )
{
int ai = offset *(2* thid +1) -1;
int bi = offset *(2* thid +2) -1;
float t = temp [ ai ];
temp [ ai ] = temp [ bi ];
temp [ bi ] += t ;
}
}
__syncthreads () ;
g_odata [2* thid ] = temp [2* thid ]; // write results to device memory
TECHNO
OF
GY
ITU
IAN INST
KH
ARAGPUR
IND
}
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction Conclusion
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
References
TECHNO
OF LO
TE
GY
ITU
IAN INST
KH
ARAGPUR
IND
19 5 1
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur