0% found this document useful (0 votes)
40 views91 pages

Reduction

The document outlines a course on optimizing reduction kernels for GPUs. The course covers GPU architectures, CUDA programming, mapping data to kernels, warp scheduling and divergence, memory access coalescing, and optimizing reduction kernels. It provides an agenda with topics covered each week over 12 weeks, and recaps previous lessons on the host-kernel model, querying device properties, scheduling warps, and performance bottlenecks like branch divergence and global memory accesses. It also lists common parallel patterns like matrix multiplication, convolution, and reduction.

Uploaded by

Vijay Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views91 pages

Reduction

The document outlines a course on optimizing reduction kernels for GPUs. The course covers GPU architectures, CUDA programming, mapping data to kernels, warp scheduling and divergence, memory access coalescing, and optimizing reduction kernels. It provides an agenda with topics covered each week over 12 weeks, and recaps previous lessons on the host-kernel model, querying device properties, scheduling warps, and performance bottlenecks like branch divergence and global memory accesses. It also lists common parallel patterns like matrix multiplication, convolution, and reduction.

Uploaded by

Vijay Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 91

Optimizing Reduction Kernels

Soumyajit Dey, Assistant Professor,


CSE, IIT Kharagpur

December 23, 2019

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Course Organization
Topic Week Hours
Review of basic COA w.r.t. performance 1 2
Intro to GPU architectures 2 3
Intro to CUDA programming 3 2
Multi-dimensional data and synchronization 4 2
Warp Scheduling and Divergence 5 2
Memory Access Coalescing 6 2
Optimizing Reduction Kernels 7 3
Kernel Fusion, Thread and Block Coarsening 8 3
OpenCL - runtime system 9 3
OpenCL - heterogeneous computing 10 2
Efficient Neural Network Training/Inferencing 11-12 6 TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Recap

I The Host-Kernel Model for CPU-GPU Systems


I The CUDA programming language
I Mapping multi-dimensional kernels to multi-dimensional data

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Recap

I Querying device properties


I The concept of scheduling warps
I Performance bottlenecks
I Branch Divergence
I Global memory accesses

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Parallel Patterns

I Matrix Multiplication (Gather Operation)


I Convolution (Stencil Operation)
I Reduction

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction Algorithm

I Reduce vector to a single value via an associative operator


I Example: sum, min, max, average, AND, OR etc.
I Visits every element in the array
I Large arrays motivate parallel execution of the reduction
I Not compute bound but memory bound

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Serial and Parallel Implementation
A sequential version A parallel version
I O(n) I O(log2 n)
I for(int i = 0, i < n, ++i) ... I “tree”-based implementation

Thread1 Thread1 Thread2

Array Array

+ + +
+
+
+

Total Total
Sum Sum TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Parallel Reduction Algorithm

To process very large arrays:


I Multiple thread blocks required
I Each block reduces a portion of the array
I Need to communicate partial results between blocks
I Need global synchronization
Problem:
I CUDA does not support global synchronization
Solution:
I Kernel decomposition
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Kernel Decomposition

I Decompose computation
into multiple kernel
invocations
I Kernel launch serves as a
global synchronization
point
I Negligible HW overhead,
low SW overhead
Figure from ’Optimizing Parallel
Reduction in CUDA’ by Mark
Harris
TECHNO
OF LO
TE

GY
ITU
Figure: Multiple Kernel Invocations

IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Optimization In Reduction

I Metrics for GPU performance:


I GFLOP/s for compute-bound kernels
I One billion floating-point operations per second
I Bandwidth for memory-bound kernels
I Rate at which data can be read from or stored into memory by a processor
I Reduction has very low arithmetic intensity
I Take 1 flop per element loaded
I Strive for peak bandwidth

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 1: Interleaved Addressing

I Each thread loads one


element from global memory
to shared memory
I A thread adds two elements
I Half of the threads is
deactivated at the end of each
step overhead
Figure: Reduction with Interleaved Addressing and TECHNO
OF LO

Divergent Branch TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 1: Kernel
__global__ void reduce1 ( int * g_idata , int * g_odata , unsigned int n ) {
extern __shared__ int sdata [];
// load shared mem
unsigned int tid = threadIdx . x ;
unsigned int i = blockIdx . x * blockDim . x + threadIdx . x ;
sdata [ tid ] = ( i < n ) ? g_idata [ i ] : 0;
__syncthreads () ;
// do reduction in shared mem
for ( unsigned int s =1; s < blockDim . x ; s *= 2)
{ // modulo arithmetic is slow !
if (( tid % (2* s ) ) == 0)
sdata [ tid ] += sdata [ tid + s ];
__syncthreads () ;
}
// write result for this block to global mem
if ( tid == 0)
g_odata [ blockIdx . x ] = sdata [0]; TE
OF
TECHNO
LO

GY
ITU
}

IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 1: Host

I The GPU kernel calculates data per block


I Partial sums computed by individual blocks
I Results will be stored in the first block elements of the global memory
I Final addition need to be done on this reduced data set
I By launching the same kernel again

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 1: Host Code for Multiple Kernel Launch

... // c u d a M e m c p y H o s t T o D e v i c e ...
int t hr e ad sP e rB lo c k = 64;
int old_blocks , blocks = ( N / t hr ea d sP e rB lo c k ) / 2;
blocks = ( blocks == 0) ? 1 : blocks ;
old_blocks = blocks ;
while ( blocks > 0) // call compute kernel
{
sum < < < blocks , threadsPerBlock , th r ea ds P er Bl o ck * sizeof ( int ) > > >( devPtrA ) ;
old_blocks = blocks ;
blocks = ( blocks / th re a ds P er Bl o ck ) / 2;
};
if ( blocks == 0 && old_blocks != 1) // final kernel call , if still needed
sum < < <1 , old_blocks /2 , ( old_blocks /2) * sizeof ( int ) > > >( devPtrA ) ;
... // c u d a M e m c p y D e v i c e T o H o s t ...
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 1: Analysis

Interleaved addressing with divergent branching

Array Size: 226


Problems:
Threads/Block: 1024
I highly divergent
GPU used: Tesla K40m
I warps are very inefficient
I half of the threads does nothing!
I % operator is very slow Reduction Time Bandwidth
Unit Second GB/Second
I loop is expensive
Reduce 1 0.03276 8.1951
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 2: Interleaved Addressing

I Replace divergent branch in


inner loop
I With strided index and
non-divergent branch
I New Problem: Shared
Memory Bank Conflicts

Figure: Interleaved Addressing Replacing Divergent OF


TECHNO
LO
TE

Branch

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 1: Kernel
__global__ void reduce1 ( int * g_idata , int * g_odata , unsigned int n ) {
extern __shared__ int sdata [];
// load shared mem
unsigned int tid = threadIdx . x ;
unsigned int i = blockIdx . x * blockDim . x + threadIdx . x ;
sdata [ tid ] = ( i < n ) ? g_idata [ i ] : 0;
__syncthreads () ;
// do reduction in shared mem
for(unsigned int s=1; s < blockDim.x; s *= 2)
{ // modulo arithmetic is slow !
if((tid % (2*s)) == 0)
sdata [ tid ] += sdata [ tid + s ];
__syncthreads () ;
}
// write result for this block to global mem
if ( tid == 0)
TECHNO
OF LO

g_odata [ blockIdx . x ] = sdata [0]; TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
} 

19 5 1


yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 2: Kernel
__global__ void reduce2 ( int * g_idata , int * g_odata , unsigned int n ) {
extern __shared__ int sdata [];
// load shared mem
unsigned int tid = threadIdx . x ;
unsigned int i = blockIdx . x * blockDim . x + threadIdx . x ;
sdata [ tid ] = ( i < n ) ? g_idata [ i ] : 0;
__syncthreads () ;
// do reduction in shared mem
for (unsigned int s=1; s < blockDim.x; s *= 2){
int index = 2 * s * tid;
if (index < blockDim.x)
sdata [ index ] += sdata [ index + s ];
__syncthreads () ;
}
// write result for this block to global mem
if ( tid == 0) OF
TECHNO
LO
TE

g_odata [ blockIdx . x ] = sdata [0];

GY
ITU
IAN INST

KH
ARAGPUR
IND
} 

19 5 1

yog, kms kOflm^




Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 2: Analysis
Interleaved addressing with divergent branching

Problems:
I highly divergent Array Size: 226
I warps are very Threads/Block: 1024
inefficient GPU used: Tesla K40m
I % operator is very
slow
I half of the threads Reduction Time Bandwidth
does nothing! Unit Second GB/Second
Reduce 1 0.03276 8.1951
I loop is expensive
Reduce 2 0.02312 11.6117
shared memory
TECHNO
OF

I
LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

bank conflicts 19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Shared Memory Bank Conflict

I Shared Memory is divided into banks and each


bank has serial read/write access
I If more than one thread attempts to access
same bank at same time, the accesses are
serialized (Bank Conflict)
I The hardware splits a memory request
decreasing the effective bandwidth

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 3: Sequential Addressing

I Replace strided indexing in


inner loop
I With reversed loop and
threadID-based indexing
I New Problem: Idle Threads
on first loop iteration

Figure: Reduction with Sequential Addressing TE


OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 2: Kernel
__global__ void reduce2 ( int * g_idata , int * g_odata , unsigned int n ) {
extern __shared__ int sdata [];
// load shared mem
unsigned int tid = threadIdx . x ;
unsigned int i = blockIdx . x * blockDim . x + threadIdx . x ;
sdata [ tid ] = ( i < n ) ? g_idata [ i ] : 0;
__syncthreads () ;
// do reduction in shared mem
for (unsigned int s=1; s < blockDim.x; s *= 2){
int index = 2 * s * tid;
if (index < blockDim.x)
sdata [ index ] += sdata [ index + s ];
__syncthreads () ;
}
// write result for this block to global mem
if ( tid == 0) OF
TECHNO
LO
TE

g_odata [ blockIdx . x ] = sdata [0];

GY
ITU
IAN INST

KH
ARAGPUR
IND
} 

19 5 1

yog, kms kOflm^




Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 3: Kernel
__global__ void reduce3 ( int * g_idata , int * g_odata , unsigned int n ) {
extern __shared__ int sdata [];
// load shared mem
unsigned int tid = threadIdx . x ;
unsigned int i = blockIdx . x * blockDim . x + threadIdx . x ;
sdata [ tid ] = ( i < n ) ? g_idata [ i ] : 0;
__syncthreads () ;
// do reduction in shared mem
for (unsigned int s=blockDim.x/2; s>0; s»=1)
{
if (tid < s)
sdata [ tid ] += sdata [ tid + s ];
__syncthreads () ;
}
// write result for this block to global mem
if ( tid == 0)
TECHNO
OF LO

g_odata [ blockIdx . x ] = sdata [0]; TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
} 

19 5 1


yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 3: Analysis
Interleaved addressing with divergent branching

Array Size: 226


Problems: Threads/Block: 1024
I loop is expensive GPU used: Tesla K40m
I shared memory bank
conflicts Reduction Time Bandwidth
I Half of the threads Unit Second GB/Second
are idle on first Reduce 1 0.03276 8.1951
loop iteration Reduce 2 0.02312 11.6117
Reduce 3 0.01939 13.839 TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 4: First Add During Load

I Make busy all threads in the


first step
I Halve the number of blocks
I Replace single load with two
loads
I Allocation process performs
the first reduction

Figure: Reduction with First Add During Load TE


OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 3: Kernel
__global__ void reduce3 ( int * g_idata , int * g_odata , unsigned int n ) {
extern __shared__ int sdata [];
// load shared mem
unsigned int tid = threadIdx . x ;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = (i < n) ? g_idata[i] : 0;
__syncthreads () ;
// do reduction in shared mem
for ( unsigned int s = blockDim . x /2; s >0; s > >=1)
{
if ( tid < s )
sdata [ tid ] += sdata [ tid + s ];
__syncthreads () ;
}
// write result for this block to global mem
if ( tid == 0)
TECHNO
OF LO

g_odata [ blockIdx . x ] = sdata [0]; TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
} 

19 5 1


yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 4: Kernel
__global__ void reduce4 ( int * g_idata , int * g_odata , unsigned int n ) {
extern __shared__ int sdata [];
// reading from global memory , writing to shared memory
unsigned int tid = threadIdx . x ;
unsigned int i = blockIdx.x * (blockDim.x * 2) + threadIdx.x;
sdata[tid] = g_idata[i] + g_idata[i+blockDim.x];
__syncthreads () ;
// do reduction in shared mem
for ( unsigned int s = blockDim . x /2; s >0; s > >=1) {
if ( tid < s )
sdata [ tid ] += sdata [ tid + s ];
__syncthreads () ;
}
// write result for this block to global mem
if ( tid == 0)
g_odata [ blockIdx . x ] = sdata [0]; TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
}

ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 4: Analysis
Memory bandwidth is still underutilized

Array Size: 226


Problems: Threads/Block: 1024
GPU used: Tesla K40m
I Half of the threads
are idle on first loop
iteration
Reduction Time Bandwidth
I loop overhead Unit Second GB/Second
I Another likely Reduce 1 0.03276 8.1951
bottleneck is Reduce 2 0.02312 11.6117
instruction overhead Reduce 3 0.01939 13.839
Reduce 4 0.01104 24.3098 TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 5: Unrolling the Last Warp

I Number of active threads decreases with the number of iteration


I When s <= 32, only one warp is left
I Warp runs the same instruction (SIMD)
I That means when s <= 32:
I "__syncthreads()" is not needed
I "if (tid < s)" is not needed
I Unroll last 6 iterations
Without unrolling, all warps execute every iteration of the for loop and if statement
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 4: Kernel
__global__ void reduce4 ( int * g_idata , int * g_odata , unsigned int n ) {
extern __shared__ int sdata [];
// reading from global memory , writing to shared memory
unsigned int tid = threadIdx . x ;
unsigned int i = blockIdx . x * ( blockDim . x * 2) + threadIdx . x ;
sdata [ tid ] = g \ _idata [ i ] + g \ _idata [ i + blockDim . x ];
__syncthreads () ;
// do reduction in shared mem
for ( unsigned int s = blockDim . x /2; s>0; s > >=1) {
if ( tid < s )
sdata [ tid ] += sdata [ tid + s ];
__syncthreads () ;
}
// write result for this block to global mem
if ( tid == 0)
g_odata [ blockIdx . x ] = sdata [0];
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 5: Kernel
__global__ void reduce5 ( int * g_idata , int * g_odata , unsigned int n ) {
extern __shared__ int sdata [];
// reading from global memory , writing to shared memory
unsigned int tid = threadIdx . x ;
unsigned int i = blockIdx . x * ( blockDim . x * 2) + threadIdx . x ;
sdata [ tid ] = g_idata [ i ] + g_idata [ i + blockDim . x ];
__syncthreads () ;
// do reduction in shared mem
for ( unsigned int s = blockDim . x /2; s>32; s > >=1) {
if ( tid < s )
sdata [ tid ] += sdata [ tid + s ];
__syncthreads () ;
}
if (tid < 32)
warpReduce(sdata, tid);
// write result for this block to global mem TECHNO
OF LO
TE

if ( tid == 0)

GY
ITU
IAN INST

KH
ARAGPUR
IND
g_odata [ blockIdx . x ] = sdata [0]; 

19 5 1

yog, kms kOflm




}
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 5: warpReduce

_device__ void warpReduce ( int * sdata , int tid ) {


sdata [ tid ] += sdata [ tid + 32];
sdata [ tid ] += sdata [ tid + 16];
sdata [ tid ] += sdata [ tid + 8];
sdata [ tid ] += sdata [ tid + 4];
sdata [ tid ] += sdata [ tid + 2];
sdata [ tid ] += sdata [ tid + 1];
}

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 5: Analysis
Array Size: 226
Threads/Block: 1024
GPU used: Tesla K40m

Problems:
Reduction Time Bandwidth
I Still have iterations
Unit Second GB/Second
I loop overhead Reduce 1 0.03276 8.1951
Reduce 2 0.02312 11.6117
Reduce 3 0.01939 13.839
Reduce 4 0.01104 24.3098
Reduce 5 0.00836 32.1053 TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 6: Complete Unrolling

If number of iterations is known at compile time, could completely unroll the reduction.
I Block size is limited to 512 or 1024 threads
I Block size should be of power-of-2
I For a fixed block size, complete unrolling is easy
I For generic implementation, solution is-
I CUDA supports C++ template parameters on device and host functions
I Block size can be specified as a function template parameter

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 5: Kernel
__global__ void reduce5 ( int * g_idata , int * g_odata , unsigned int n ) {
extern __shared__ int sdata [];
// reading from global memory , writing to shared memory
unsigned int tid = threadIdx . x ;
unsigned int i = blockIdx . x * ( blockDim . x * 2) + threadIdx . x ;
sdata [ tid ] = g_idata [ i ] + g_idata [ i + blockDim . x ];
__syncthreads () ;
// do reduction in shared mem
for (unsigned int s=blockDim.x/2; s>32; s»=1) {
if ( tid < s )
sdata [ tid ] += sdata [ tid + s ];
__syncthreads () ;
}
if ( tid < 32)
warpReduce ( sdata , tid ) ;
// write result for this block to global mem
if ( tid == 0) TE
OF
TECHNO
LO

GY
ITU
g_odata [ blockIdx . x ] = sdata [0];

IAN INST

KH
ARAGPUR
IND
 

} 19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 6: Kernel
Specify block size as a function template parameter and all code highlighted in
yellow will be evaluated at compile time.
template < unsigned int blockSize >
__global__ void reduce6 ( int * g_idata , int * g_odata , unsigned int n ) {
extern __shared__ int sdata [];
// reading from global memory , writing to shared memory
unsigned int tid = threadIdx . x ;
unsigned int i = blockIdx . x * ( blockDim . x * 2) + threadIdx . x ;
sdata [ tid ] = g_idata [ i ] + g_idata [ i + blockDim . x ];
__syncthreads () ;

// do reduction in shared memory


if (blockSize >= 512) {
if ( tid < 256)
sdata [ tid ] += sdata [ tid + 256];
__syncthreads () ; TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
IND
} 

19 5 1


yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 6: Kernel
if (blockSize >= 256) {
if ( tid < 128)
sdata [ tid ] += sdata [ tid + 128];
__syncthreads () ;
}
if (blockSize >= 128) {
if ( tid < 64)
sdata [ tid ] += sdata [ tid + 64];
__syncthreads () ;
}
if ( tid < 32)
warpReduce < blockSize >( sdata , tid ) ;

// write result for this block to global mem


if ( tid == 0)
TECHNO
OF

g_odata [ blockIdx . x ] = sdata [0]; TE


LO

GY
ITU
IAN INST

KH
ARAGPUR
}

IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 6: Kernel

Modified warpReduce function:


Template < unsigned int blockSize >
__device__ void warpReduce ( volatile int * sdata , int tid )
{
if (blockSize >= 64) sdata [ tid ] += sdata [ tid + 32];
if (blockSize >= 32) sdata [ tid ] += sdata [ tid + 16];
if (blockSize >= 16) sdata [ tid ] += sdata [ tid + 8];
if (blockSize >= 8) sdata [ tid ] += sdata [ tid + 4];
if (blockSize >= 4) sdata [ tid ] += sdata [ tid + 2];
if (blockSize >= 2) sdata [ tid ] += sdata [ tid + 1];
}
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 6: Invoking Template Kernels
Use a switch statement for possible block sizes while invoking template kernels
switch ( th re a ds P er Bl o ck ) {
case 512: reduce5 <512 > < < < dimGrid , dimBlock , smemSize > > >( d_idata , d_odata ) ;
break ;
case 256: reduce5 <256 > < < < dimGrid , dimBlock , smemSize > > >( d_idata , d_odata ) ;
break ;
case 128: reduce5 <128 > < < < dimGrid , dimBlock , smemSize > > >( d_idata , d_odata ) ;
break ;
case 64: reduce5 <64 > < < < dimGrid , dimBlock , smemSize > > >( d_idata , d_odata ) ;
break ;
case 32: reduce5 <32 > < < < dimGrid , dimBlock , smemSize > > >( d_idata , d_odata ) ;
break ;
case 16: reduce5 <16 > < < < dimGrid , dimBlock , smemSize > > >( d_idata , d_odata ) ;
break ;
case 8: reduce5 <8 > < < < dimGrid , dimBlock , smemSize > > >( d_idata , d_odata ) ;
break ;
case 4: reduce5 <4 > < < < dimGrid , dimBlock , smemSize > > >( d_idata , d_odata ) ; break ;
TECHNO

case 2: reduce5 <2 > < < < dimGrid , dimBlock , smemSize > > >( d_idata , d_odata ) ; break ; TE
OF LO

GY
ITU
IAN INST

KH
ARAGPUR
case 1: reduce5 <1 > < < < dimGrid , dimBlock , smemSize > > >( d_idata , d_odata ) ; break ;

IND
 

19 5 1

} yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 6: Analysis
Array Size: 226
Threads/Block: 1024
GPU used: Tesla K40m

I Algorithm Reduction Time Bandwidth


Cascading can lead Unit Second GB/Second
to significant Reduce 1 0.03276 8.1951
speedups in practice Reduce 2 0.02312 11.6117
Reduce 3 0.01939 13.839
Reduce 4 0.01104 24.3098
Reduce 5 0.00836 32.1053 TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
Reduce 6 0.00769 34.9014

IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 7: Multiple Adds / Thread

Algorithm Cascading:
I Combine sequential and parallel reduction
I Each thread loads and sums multiple elements into shared memory
I Tree-based reduction in shared memory
I Replace load and add two elements
I With a loop to add as many as necessary

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 6: Kernel

__global__ void reduce6 ( int * g_idata , int * g_odata , unsigned int n ) {


extern __shared__ int sdata [];
// reading from global memory , writing to shared memory
unsigned int tid = threadIdx . x ;
unsigned int i = blockIdx . x * ( blockDim . x * 2) + threadIdx . x ;
sdata[tid] = g_idata[i] + g_idata[i+blockDim.x];
__syncthreads () ;

// do reduction in shared mem


...
// write result for this block to global mem
...
}

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 7: Kernel
__global__ void reduce7 ( int * g_idata , int * g_odata , unsigned int n )
{
extern __shared__ int sdata [];
// reading from global memory , writing to shared memory
unsigned int tid = threadIdx . x ;
unsigned int i = blockIdx . x * ( blockDim . x * 2) + threadIdx . x ;
unsigned int gridSize = blockSize*2*gridDim.x;
sdata[tid] = 0;
while (i < n) {
sdata[tid] += g_idata[i] + g_idata[i+blockSize];
i += gridSize;
}
__syncthreads () ;
// do reduction in shared mem
...
TECHNO
OF

// write result for this block to global mem TE


LO

GY
ITU
IAN INST

KH
ARAGPUR
...

IND
 

19 5 1

}
yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 7: Analysis
Array Size: 226
Threads/Block: 1024
GPU used: Tesla K40m

Reduction Time Bandwidth


Unit Second GB/Second
Reduce 1 0.03276 8.1951
Reduce 2 0.02312 11.6117
Reduce 3 0.01939 13.839
Reduce 4 0.01104 24.3098
Reduce 5 0.00836 32.1053
Reduce 6 0.00769 34.9014 TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
Reduce 7 0.00277 96.8672

IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction Performance Comparison wrt time in Tesla K40m GPU
Data Size (n)
2^19
60 2^20
2^21
2^22
50 2^23
2^24
2^25
40 2^26
2^27
Time (ms)

30

20

10

0 TECHNO
OF LO
TE

0 1 2 3 4 5 6

GY
ITU
IAN INST

KH
ARAGPUR
Reduction Level

IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction Performance Comparison wrt bandwidth in Tesla K40m GPU
100 Data Size (n)
2^19
2^20
2^21
80 2^22
2^23
2^24
2^25
2^26
Bandwidth (GB/s)

60
2^27

40

20

TECHNO
OF LO
TE

0 1 2 3 4 5 6

GY
ITU
IAN INST

KH
ARAGPUR
Reduction Level

IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Types of Optimization

I Algorithmic optimizations
I Changes to addressing, algorithm cascading (Reduction No. 1 to 4, 7)
I Approx 12x speedup
I Code optimizations
I Loop unrolling (Reduction No. 5, 6)
I Approx 3x speedup

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Summary

Kernel Optimization
Reduce1 Interleaved addressing (using modulo arithmetic) with divergent branching
Reduce2 Interleaved addressing (using contiguous threads) with bank conflicts
Reduce3 Sequential addressing, no divergence or bank conflicts
Reduce4 Uses n/2 threads, performs first level during global load
Reduce5 Unrolled loop for last warp, intra-warp synchronisation barriers removed
Completely unrolled, using template parameter to assert whether the number
Reduce6
of threads is a power of two
Multiple elements per thread, small constant number of thread blocks
Reduce7
launched. Requires very few synchronisation barriers OF
TECHNO
LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Example of Applications on Reduction

I Bitonic Sort
I Prefix sum

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Problem: Sorting

I Sort any random permutation of numbers in ascending or descending order


I Basic introduction to sorting networks
I Focus on a comparison based sort - Bitonic Sort
I Discuss how operations can be parallelized using CUDA.

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Sorting Networks

A sorting network is composed of two elements


I Wires: Wires run from left to right, carrying values (one per wire) that traverse
the network all at the same time.
I Comparators: Comparators connect two wires. When a pair of values, traveling
through a pair of wires, encounter a comparator, the comparator may or may not
swap the values.

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Comparator

a min(a,b) a max(a,b)

Ascending (TRUE) Descending (FALSE)

b max(a,b) b min(a.b)

Figure: Comparator Function

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
A Simple Sorting Network

Sort four numbers a, b, c, d in ascending order where a > b > c > d


 b d d d

b   c c

c d b b b

d c c  

Figure: Sorting Four Numbers

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Bubble Sort

Any comparison based sort can be done using a sorting network. TE


OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Bitonic Sort

Bitonic sort takes place using two fundamental steps:


I Step I: Convert an arbitrary sequence to a bitonic sequence.
I Step II: Convert a bitonic sequence to a sorted sequence.
A Bitonic Sequence is a sequence of numbers which is first strictly increasing then after
a point strictly decreasing. a1 < a2 < ... < am > b1 > b2 > ... > bn

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Bitonic Sort

7 7 7 7 7 3 2
9 9 8 8 4 2 3
8 11 11 9 3 7 4
11 8 9 11 2 4 7
12 3 4 12 12 9 8
3 12 12 4 8 8 9
4 4 3 3 9 12 11
2 2 2 2 11 11 12

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
Step I Step II

IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Recursive Structure

I If you look closely, Step I uses Step II recursively on smaller sequences.


I Step II can be used to sort in any order (ascending or descending). The order can
be controlled using the comparator.
I Step I uses Step II in a way to construct subseqences that are bitonic in nature.
http://www.iti.fh-flensburg.de/lang/algorithmen/sortieren/bitonic/bitonicen.htm

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Recursive C Program

// Comparator
void compare ( int i , int j , boolean dir ) {
if ( dir ==( a [ i ] > a [ j ]) )
exchange (i , j ) ;
}

// Step II
void bitonicMerge ( int lo , int n , boolean dir ) {
if (n >1) {
int m = n /2;
for ( int i = lo ; i < lo + m ; i ++)
compare (i , i +m , dir ) ;
bitonicMerge ( lo , m , dir ) ;
bitonicMerge ( lo +m , m , dir ) ;
}
} OF
TECHNO
LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Recursive C Program

// Step I
void bitonicSort ( int lo , int n , boolean dir ) {
if (n >1)
{
int m = n /2;
bitonicSort ( lo , m , ASCENDING ) ;
bitonicSort ( lo +m , m , DESCENDING ) ;
bitonicMerge ( lo , n , dir ) ;
}
}

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Scope for Parallelization

7 7 7 7 7 3 2
9 9 8 8 4 2 3
8 11 11 9 3 7 4
11 8 9 11 2 4 7
12 3 4 12 12 9 8
3 12 12 4 8 8 9
4 4 3 3 9 12 11
2 2 2 2 11 11 12 TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Formulate Parallel Solution

I Associate every cuda thread block with a sorting subproblem.


I Merge results from each SM to solve the original sorting problem.

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Mapping Sorting Subproblem

7 7 7 7 7 3 2
9 9 8 8 4 2 3
8 11 11 9 3 7 4
11 8 9 11 2 4 7 Thread Block
4 threads for
12 3 4 12 12 9 8 eight elements

3 12 12 4 8 8 9
Shared Memory Size
4 4 3 3 9 12 11 =
2 * Number of Threads
2 2 2 2 11 11 12 TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Comparator

__device__ inline void Comparator ( uint & keyA , uint & valA , uint & keyB , uint & valB
, uint dir ) {
uint t ;
if (( keyA > keyB ) == dir ) {
t = keyA ;
keyA = keyB ;
keyB = t ;
t = valA ;
valA = valB ;
valB = t ;
}
}

NVIDIA CUDA SDK Benchmark Suite


TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Sort in Shared Memory

__global__ void b i t o n i c S o r t S h a r e d 1 ( uint * d_DstKey , uint * d_DstVal , uint *


d_SrcKey , uint * d_SrcVal ) {
// Shared memory storage for current subarray
__shared__ uint s_key [ SHARED_SIZE ];
__shared__ uint s_val [ SHARED_SIZE ];

// Offset to the beginning of subarray and load data


d_SrcKey += blockIdx . x * SHARED_SIZE + threadIdx . x ;
d_SrcVal += blockIdx . x * SHARED_SIZE + threadIdx . x ;
d_DstKey += blockIdx . x * SHARED_SIZE + threadIdx . x ;
d_DstVal += blockIdx . x * SHARED_SIZE + threadIdx . x ;

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
s_key [ threadIdx . x +0]= d_SrcKey [0];
s_val [ threadIdx . x +0]= d_SrcVal [0];
s_key [ threadIdx . x + SHARED_SIZE /2]= d_SrcKey [( SHARED_SIZE /2];
s_val [ threadIdx . x + SHARED_SIZE /2]= d_SrcVal [( SHARED_SIZE /2];
// strided load by threads , each thread loads two elements
for ( size =2; size < SHARED_SIZE ; size < <=1) {
// Bitonic merge
uint ddd =( threadIdx . x &( size /2) ) !=0;
for ( stride = size /2; stride >0; stride > >=1) {
__syncthreads () ;
pos =2* threadIdx .x - threadIdx . x &( stride -1) ;
Comparator ( s_key [ pos +0] , s_val [ pos +0] ,
s_key [ pos + stride ] , s_val [ pos + stride ] , ddd ) ;
}
}

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Important Variables

I Size denotes the length of bitonic sequences being generated.


I Pos denotes the position of the first item being processed by the thread and is
dependent on thread id and stride.
I Stride denotes the distance between the position of numbers being sorted by a
thread.
I ddd represents direction and is dependent on thread id and size.

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Bitonic Sequence Creation
size = 2 size = 4

stride = 1 stride = 2 stride = 1


7 7 7 7
0 0 0
9 9 8 8
1 ddd = (1&(4/2)! = 0) = 1
1 8 11 11 9
1 1 pos = 2 ∗ 1 − 1&(2 − 1) = 1
11 8 9 11
pos + stride = 1 + 1 = 2
2 12 3 4 12
2 2 2
3 12 12 4
4 4 3 3 33
3 3
2 2 2 2 TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
// sort in opposite directions odd / even block ids
uint ddd = ( blockIdx . x + 1) & 1;
for ( stride = SHARED_SIZE /2; stride >0; stride > >=1) {
__syncthreads () ;
pos =2* threadIdx .x - threadIdx . x &( stride - 1) ) ;
Comparator ( s_key [ pos +0] , s_val [ pos +0] ,
s_key [ pos + stride ] , s_val [ pos + stride ] , ddd ) ;
}
__syncthreads () ;
d_DstKey [0]= s_key [ threadIdx . x +0];
d_DstVal [0]= s_val [ threadIdx . x +0];
d_DstKey [ SHARED_SIZE /2]= s_key [ threadIdx . x + SHARED_SIZE /2];
d_DstVal [ SHARED_SIZE /2]= s_val [ threadIdx . x + SHARED_SIZE /2];
}

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Sorting Bitonic Sequence
Block
 = (0 + 1)1 = 1
stride=4 stride=2 stride=
7 7 3 2
8 4 2 3
9 3 7 4
11 2 4 7
12 12 9 8
4 8 8 9
3 9 12 11
2 11 11 12 TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Example Applications of reduction

I Bitonic Sort
I Prefix Sum

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
All-Prefix-Sums

The all-prefix-sums operation takes a binary associative operator ⊕ , and an array of n


elements.
[a0 , a1 , ..., an−1 ],
and returns the array
[a0 , (a0 ⊕ a1 ), ..., (a0 ⊕ a1 ⊕, ... ⊕ an−1 )].

Example: If ⊕ is addition, then the all-prefix-sums operation on the array


[3 1 7 0 4 1 6 3],
would return
[3 4 11 11 15 16 22 25].
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Inclusive and Exclusive scan

All-prefix-sums on an array of data is commonly known as scan.


I Inclusive scan - a scan of an array generates a new array where each element j is
the sum of all elements up to and including j.
I Exclusive scan - a scan of an array generates a new array where each element j is
the sum of all elements excluding j.
The exclusive scan operation takes a binary associative operator ⊕ with identity I , and
an array of n elements
[a0 , a1 , ..., an−1 ],
and returns the array
[I , a0 , (a0 ⊕ a1 ), ..., (a0 ⊕ a1 ⊕, ... ⊕ an−2 )]. OF
TECHNO
LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Example - Inclusive Scan
0 1 2 3 4 5 6 7 Array Indexes

3 1 7 0 4 1 6 3 Input Array

3 1 7 0 4 1 6 3

3 4 7 0 4 1 6 3

3 4 11 0 4 1 6 3
Output Array

3 4 11 11 4 1 6 3

3 4 11 11 15 1 6 3

3 4 11 11 15 16 6 3

3 4 11 11 15 16 22 3
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
3 4 11 11 15 16 22 25

IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Sequential Code

Inclusive Scan
void scan ( float * output , float * input , int length )
{
output [0] = input [0]; // since this is a inclusive scan

for ( int j = 1; j < length ; ++ j )


output [ j ] = output [j -1] + input [ j ];

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Sequential Code - Complexity

I Code performs exactly n − 1 adds for an array of length n


I Work complexity is O(n)
I Very large n, motivate parallel execution

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Parallel Prefix Sum - Hillis/Steel Scan (Algorithm 1)
for d = 1 to log2 n do
forall k ≥ in parallel do
x[k] = x[k − 2d−1 ] + x[k]

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Example- Hillis/Steel Inclusive Scan

1 2 3 4 5 6 7 8
Step-I
Add elements 20
step away
1 3 5 7 9 11 13 15
Step-II
Add elements 21
step away
1 3 6 10 14 18 22 26

Step-III
Add elements 22
step away

1 3 6 10 15 21 28 36
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
Nos of steps O(log n)

IND
 

Work O (n * log n)
19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Analysis

I The algorithm performs O(nlog2 n) additions operations.


I Remember that a sequential scan performs O(n) adds. Therefore, this naïve
implementation is not work-efficient.
I Algorithm 1 assumes that there are as many processors as data elements. On a
GPU running CUDA, this is not usually the case.
I Instead, the forall is automatically divided into small parallel batches (called
warps) that are executed sequentially on a multiprocessor.
I The algorithm 1 will not work because it performs the scan in place on the array.
The results of one warp will be overwritten by threads in another warp.
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
A double-buffered version of the sum scan from Algorithm 1

Algorithm 2
for d := 1 to log2 n do
forall k in parallel do
if k ≥ 2d then
x[out][k] := x[in][k-2d−1 ] + x[in][k]
else
x[out][k] := x[in][k]
swap(in,out)

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
CUDA C code - Algorithm 1
global__ void scan ( float * g_odata , float * g_idata , int n )
{
extern __shared__ float temp []; // allocated on invocation
int thid = threadIdx . x ;
int pout = 0 , pin = 1;
// For exclusive scan , shift right by one and set first elt to 0
temp [ pout * n + thid ] = ( thid > 0) ? g_idata [ thid -1] : 0;
__syncthreads () ;
for ( int offset = 1; offset < n ; offset *= 2)
{
pout = 1 - pout ;
pin = 1 - pout ;
if ( thid >= offset )
temp [ pout * n + thid ] += temp [ pin * n + thid - offset ];
else
temp [ pout * n + thid ] = temp [ pin * n + thid ];
__syncthreads () ;
} TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
g_odata [ thid ] = temp [ pout * n + thid1 ];

IND
 

19 5 1

} yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Parallel Prefix Sum - Blelloch Scan

I The idea is to build a balanced binary tree on the input data and sweep it to and
from the root to compute the prefix sum.
I A binary tree with n leaves has logn levels, and each level d ∈ [0, n) has 2d nodes.
I If we perform one add per node, then we will perform O(n) adds on a single
traversal of the tree.

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Parallel Prefix Sum - Blelloch Scan

The algorithm consists of two phases:


I Reduction Phase: we traverse the tree from leaves to root computing partial
sums at internal nodes of the tree
I Down Sweep Phase: We traverse back up the tree from the root, using the
partial sums to build the scan in place on the array using the partial sums
computed by the reduce phase.

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Blelloch Scan - Reduce Phase

for d:= 0 to log2 n − 1 do


for k from 0 to n − 1 by 2d+1 in paralle do
x[k + 2d+1 − 1] := x[k + 2d − 1] + x[k + 2d+1 − 1]

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Blelloch Scan - Reduce Phase

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Blelloch Scan - Down-Sweep Phase

x[n − 1] := 0
for d:= log2 n down to 0 do
for k from 0 to n − 1 by 2d+1 in parallel do
t := x[k + 2d − 1]
x[k + 2d − 1] := x[k + 2d+1 − 1]
x[k + 2d+1 − 1] := t + x[k + 2d+1 − 1]

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Blelloch Scan - Down-Sweep Phase

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Example - Blelloch Exclusive Scan

1 2 3 4 5 6 7 8
Input

3 7 11 15 Reduce Step 0

10 26 Reduce Step 1

36 Reduce Step 2

10 0 Identity Elements

3 0 11 10 Down Sweep 0

1 0 3 3 5 10 7 21 Down Sweep 1

TECHNO
OF LO
TE

Down Sweep 2

GY
ITU
0 1 3 6 10 15 21 28

IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Cuda code
__global__ void prescan ( float * g_odata , float * g_idata , int n )
{
extern __shared__ float temp []; // allocated on invocation
int thid = threadIdx . x ;
int offset = 1;

temp [2* thid ] = g_idata [2* thid ]; // load input into shared memory
temp [2* thid +1] = g_idata [2* thid +1];
for ( int d = n > >1; d > 0; d > >= 1) // build sum in place up the tree
{
__syncthreads () ;
if ( thid < d )
{
int ai = offset *(2* thid +1) -1;
int bi = offset *(2* thid +2) -1;
temp [ bi ] += temp [ ai ];
}
TECHNO
OF

offset *= 2; TE
LO

GY
ITU
IAN INST

KH
ARAGPUR
IND
} 

19 5 1


yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Cuda code
if ( thid == 0)
temp [ n - 1] = 0; // clear the last element
for ( int d = 1; d < n ; d *= 2) // traverse down tree & build scan
{
offset > >= 1;
__syncthreads () ;
if ( thid < d )
{
int ai = offset *(2* thid +1) -1;
int bi = offset *(2* thid +2) -1;
float t = temp [ ai ];
temp [ ai ] = temp [ bi ];
temp [ bi ] += t ;
}
}
__syncthreads () ;
g_odata [2* thid ] = temp [2* thid ]; // write results to device memory
TECHNO
OF

g_odata [2* thid +1] = temp [2* thid +1]; TE


LO

GY
ITU
IAN INST

KH
ARAGPUR
IND
} 

19 5 1


yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction Conclusion

We have learnt how to-


I Understand CUDA performance characteristics
I Memory coalescing,
I Divergent branching,
I Bank conflicts,
I Latency hiding
I Use peak performance metrics to guide optimization
I Understand parallel algorithm complexity theory
I Identify type of bottleneck and
I Suitably optimize the algorithm
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
References

1. Optimizing Parallel Reduction in CUDA by Mark Harris


White paper available at http://docs.nvidia.com.
2. Reductions and Low-Level Performance Considerations by David Tarjan
3. Parallel Prefix Sum (Scan) with CUDA by Mark Harris
White paper available at
https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html.

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
 

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur

You might also like