0% found this document useful (0 votes)

40 views91 pages

Reduction

The document outlines a course on optimizing reduction kernels for GPUs. The course covers GPU architectures, CUDA programming, mapping data to kernels, warp scheduling and divergence, memory access coalescing, and optimizing reduction kernels. It provides an agenda with topics covered each week over 12 weeks, and recaps previous lessons on the host-kernel model, querying device properties, scheduling warps, and performance bottlenecks like branch divergence and global memory accesses. It also lists common parallel patterns like matrix multiplication, convolution, and reduction.

Uploaded by

Vijay Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views91 pages

Reduction

Uploaded by

Vijay Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 91

Optimizing Reduction Kernels

Soumyajit Dey, Assistant Professor,

CSE, IIT Kharagpur

December 23, 2019

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Course Organization
Topic Week Hours
Review of basic COA w.r.t. performance 1 2
Intro to GPU architectures 2 3
Intro to CUDA programming 3 2
Multi-dimensional data and synchronization 4 2
Warp Scheduling and Divergence 5 2
Memory Access Coalescing 6 2
Optimizing Reduction Kernels 7 3
Kernel Fusion, Thread and Block Coarsening 8 3
OpenCL - runtime system 9 3
OpenCL - heterogeneous computing 10 2
Efficient Neural Network Training/Inferencing 11-12 6 TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Recap

I The Host-Kernel Model for CPU-GPU Systems

I The CUDA programming language
I Mapping multi-dimensional kernels to multi-dimensional data

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Recap

I Querying device properties

I The concept of scheduling warps
I Performance bottlenecks
I Branch Divergence
I Global memory accesses

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Parallel Patterns

I Matrix Multiplication (Gather Operation)

I Convolution (Stencil Operation)
I Reduction

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction Algorithm

I Reduce vector to a single value via an associative operator

I Example: sum, min, max, average, AND, OR etc.
I Visits every element in the array
I Large arrays motivate parallel execution of the reduction
I Not compute bound but memory bound

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Serial and Parallel Implementation
A sequential version A parallel version
I O(n) I O(log2 n)
I for(int i = 0, i < n, ++i) ... I “tree”-based implementation

Thread1 Thread1 Thread2

Array Array

+ + +
+
+
+

Total Total
Sum Sum TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Parallel Reduction Algorithm

To process very large arrays:

I Multiple thread blocks required
I Each block reduces a portion of the array
I Need to communicate partial results between blocks
I Need global synchronization
Problem:
I CUDA does not support global synchronization
Solution:
I Kernel decomposition
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Kernel Decomposition

I Decompose computation
into multiple kernel
invocations
I Kernel launch serves as a
global synchronization
point
I Negligible HW overhead,
low SW overhead
Figure from ’Optimizing Parallel
Reduction in CUDA’ by Mark
Harris
TECHNO
OF LO
TE

GY
ITU
Figure: Multiple Kernel Invocations

IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Optimization In Reduction

I Metrics for GPU performance:

I GFLOP/s for compute-bound kernels
I One billion floating-point operations per second
I Bandwidth for memory-bound kernels
I Rate at which data can be read from or stored into memory by a processor
I Reduction has very low arithmetic intensity
I Take 1 flop per element loaded
I Strive for peak bandwidth

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 1: Interleaved Addressing

I Each thread loads one

element from global memory
to shared memory
I A thread adds two elements
I Half of the threads is
deactivated at the end of each
step overhead
Figure: Reduction with Interleaved Addressing and TECHNO
OF LO

Divergent Branch TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 1: Kernel
__global__ void reduce1 ( int * g_idata , int * g_odata , unsigned int n ) {
extern __shared__ int sdata [];
// load shared mem
unsigned int tid = threadIdx . x ;
unsigned int i = blockIdx . x * blockDim . x + threadIdx . x ;
sdata [ tid ] = ( i < n ) ? g_idata [ i ] : 0;
__syncthreads () ;
// do reduction in shared mem
for ( unsigned int s =1; s < blockDim . x ; s *= 2)
{ // modulo arithmetic is slow !
if (( tid % (2* s ) ) == 0)
sdata [ tid ] += sdata [ tid + s ];
__syncthreads () ;
}
// write result for this block to global mem
if ( tid == 0)
g_odata [ blockIdx . x ] = sdata [0]; TE
OF
TECHNO
LO

GY
ITU
}

IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 1: Host

I The GPU kernel calculates data per block

I Partial sums computed by individual blocks
I Results will be stored in the first block elements of the global memory
I Final addition need to be done on this reduced data set
I By launching the same kernel again

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 1: Host Code for Multiple Kernel Launch

... // c u d a M e m c p y H o s t T o D e v i c e ...
int t hr e ad sP e rB lo c k = 64;
int old_blocks , blocks = ( N / t hr ea d sP e rB lo c k ) / 2;
blocks = ( blocks == 0) ? 1 : blocks ;
old_blocks = blocks ;
while ( blocks > 0) // call compute kernel
{
sum < < < blocks , threadsPerBlock , th r ea ds P er Bl o ck * sizeof ( int ) > > >( devPtrA ) ;
old_blocks = blocks ;
blocks = ( blocks / th re a ds P er Bl o ck ) / 2;
};
if ( blocks == 0 && old_blocks != 1) // final kernel call , if still needed
sum < < <1 , old_blocks /2 , ( old_blocks /2) * sizeof ( int ) > > >( devPtrA ) ;
... // c u d a M e m c p y D e v i c e T o H o s t ...
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 1: Analysis

Interleaved addressing with divergent branching

Array Size: 226

Problems:
Threads/Block: 1024
I highly divergent
GPU used: Tesla K40m
I warps are very inefficient
I half of the threads does nothing!
I % operator is very slow Reduction Time Bandwidth
Unit Second GB/Second
I loop is expensive
Reduce 1 0.03276 8.1951
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 2: Interleaved Addressing

I Replace divergent branch in

inner loop
I With strided index and
non-divergent branch
I New Problem: Shared
Memory Bank Conflicts

Figure: Interleaved Addressing Replacing Divergent OF

TECHNO
LO
TE

Branch

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 1: Kernel
__global__ void reduce1 ( int * g_idata , int * g_odata , unsigned int n ) {
extern __shared__ int sdata [];
// load shared mem
unsigned int tid = threadIdx . x ;
unsigned int i = blockIdx . x * blockDim . x + threadIdx . x ;
sdata [ tid ] = ( i < n ) ? g_idata [ i ] : 0;
__syncthreads () ;
// do reduction in shared mem
for(unsigned int s=1; s < blockDim.x; s *= 2)
{ // modulo arithmetic is slow !
if((tid % (2*s)) == 0)
sdata [ tid ] += sdata [ tid + s ];
__syncthreads () ;
}
// write result for this block to global mem
if ( tid == 0)
TECHNO
OF LO

g_odata [ blockIdx . x ] = sdata [0]; TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
}

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 2: Kernel
__global__ void reduce2 ( int * g_idata , int * g_odata , unsigned int n ) {
extern __shared__ int sdata [];
// load shared mem
unsigned int tid = threadIdx . x ;
unsigned int i = blockIdx . x * blockDim . x + threadIdx . x ;
sdata [ tid ] = ( i < n ) ? g_idata [ i ] : 0;
__syncthreads () ;
// do reduction in shared mem
for (unsigned int s=1; s < blockDim.x; s *= 2){
int index = 2 * s * tid;
if (index < blockDim.x)
sdata [ index ] += sdata [ index + s ];
__syncthreads () ;
}
// write result for this block to global mem
if ( tid == 0) OF
TECHNO
LO
TE

g_odata [ blockIdx . x ] = sdata [0];

GY
ITU
IAN INST

KH
ARAGPUR
IND
}

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 2: Analysis
Interleaved addressing with divergent branching

Problems:
I highly divergent Array Size: 226
I warps are very Threads/Block: 1024
inefficient GPU used: Tesla K40m
I % operator is very
slow
I half of the threads Reduction Time Bandwidth
does nothing! Unit Second GB/Second
Reduce 1 0.03276 8.1951
I loop is expensive
Reduce 2 0.02312 11.6117
shared memory
TECHNO
OF

I
LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

bank conflicts 19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Shared Memory Bank Conflict

I Shared Memory is divided into banks and each

bank has serial read/write access
I If more than one thread attempts to access
same bank at same time, the accesses are
serialized (Bank Conflict)
I The hardware splits a memory request
decreasing the effective bandwidth

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 3: Sequential Addressing

I Replace strided indexing in

inner loop
I With reversed loop and
threadID-based indexing
I New Problem: Idle Threads
on first loop iteration

Figure: Reduction with Sequential Addressing TE

OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

g_odata [ blockIdx . x ] = sdata [0];

GY
ITU
IAN INST

KH
ARAGPUR
IND
}

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 3: Kernel
__global__ void reduce3 ( int * g_idata , int * g_odata , unsigned int n ) {
extern __shared__ int sdata [];
// load shared mem
unsigned int tid = threadIdx . x ;
unsigned int i = blockIdx . x * blockDim . x + threadIdx . x ;
sdata [ tid ] = ( i < n ) ? g_idata [ i ] : 0;
__syncthreads () ;
// do reduction in shared mem
for (unsigned int s=blockDim.x/2; s>0; s»=1)
{
if (tid < s)
sdata [ tid ] += sdata [ tid + s ];
__syncthreads () ;
}
// write result for this block to global mem
if ( tid == 0)
TECHNO
OF LO

g_odata [ blockIdx . x ] = sdata [0]; TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
}

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 3: Analysis
Interleaved addressing with divergent branching

Array Size: 226

Problems: Threads/Block: 1024
I loop is expensive GPU used: Tesla K40m
I shared memory bank
conflicts Reduction Time Bandwidth
I Half of the threads Unit Second GB/Second
are idle on first Reduce 1 0.03276 8.1951
loop iteration Reduce 2 0.02312 11.6117
Reduce 3 0.01939 13.839 TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 4: First Add During Load

I Make busy all threads in the

first step
I Halve the number of blocks
I Replace single load with two
loads
I Allocation process performs
the first reduction

Figure: Reduction with First Add During Load TE

OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 3: Kernel
__global__ void reduce3 ( int * g_idata , int * g_odata , unsigned int n ) {
extern __shared__ int sdata [];
// load shared mem
unsigned int tid = threadIdx . x ;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = (i < n) ? g_idata[i] : 0;
__syncthreads () ;
// do reduction in shared mem
for ( unsigned int s = blockDim . x /2; s >0; s > >=1)
{
if ( tid < s )
sdata [ tid ] += sdata [ tid + s ];
__syncthreads () ;
}
// write result for this block to global mem
if ( tid == 0)
TECHNO
OF LO

g_odata [ blockIdx . x ] = sdata [0]; TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
}

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 4: Kernel
__global__ void reduce4 ( int * g_idata , int * g_odata , unsigned int n ) {
extern __shared__ int sdata [];
// reading from global memory , writing to shared memory
unsigned int tid = threadIdx . x ;
unsigned int i = blockIdx.x * (blockDim.x * 2) + threadIdx.x;
sdata[tid] = g_idata[i] + g_idata[i+blockDim.x];
__syncthreads () ;
// do reduction in shared mem
for ( unsigned int s = blockDim . x /2; s >0; s > >=1) {
if ( tid < s )
sdata [ tid ] += sdata [ tid + s ];
__syncthreads () ;
}
// write result for this block to global mem
if ( tid == 0)
g_odata [ blockIdx . x ] = sdata [0]; TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
}

ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 4: Analysis
Memory bandwidth is still underutilized

Array Size: 226

Problems: Threads/Block: 1024
GPU used: Tesla K40m
I Half of the threads
are idle on first loop
iteration
Reduction Time Bandwidth
I loop overhead Unit Second GB/Second
I Another likely Reduce 1 0.03276 8.1951
bottleneck is Reduce 2 0.02312 11.6117
instruction overhead Reduce 3 0.01939 13.839
Reduce 4 0.01104 24.3098 TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 5: Unrolling the Last Warp

I Number of active threads decreases with the number of iteration

I When s <= 32, only one warp is left
I Warp runs the same instruction (SIMD)
I That means when s <= 32:
I "__syncthreads()" is not needed
I "if (tid < s)" is not needed
I Unroll last 6 iterations
Without unrolling, all warps execute every iteration of the for loop and if statement
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 4: Kernel
__global__ void reduce4 ( int * g_idata , int * g_odata , unsigned int n ) {
extern __shared__ int sdata [];
// reading from global memory , writing to shared memory
unsigned int tid = threadIdx . x ;
unsigned int i = blockIdx . x * ( blockDim . x * 2) + threadIdx . x ;
sdata [ tid ] = g \ _idata [ i ] + g \ _idata [ i + blockDim . x ];
__syncthreads () ;
// do reduction in shared mem
for ( unsigned int s = blockDim . x /2; s>0; s > >=1) {
if ( tid < s )
sdata [ tid ] += sdata [ tid + s ];
__syncthreads () ;
}
// write result for this block to global mem
if ( tid == 0)
g_odata [ blockIdx . x ] = sdata [0];
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 5: Kernel
__global__ void reduce5 ( int * g_idata , int * g_odata , unsigned int n ) {
extern __shared__ int sdata [];
// reading from global memory , writing to shared memory
unsigned int tid = threadIdx . x ;
unsigned int i = blockIdx . x * ( blockDim . x * 2) + threadIdx . x ;
sdata [ tid ] = g_idata [ i ] + g_idata [ i + blockDim . x ];
__syncthreads () ;
// do reduction in shared mem
for ( unsigned int s = blockDim . x /2; s>32; s > >=1) {
if ( tid < s )
sdata [ tid ] += sdata [ tid + s ];
__syncthreads () ;
}
if (tid < 32)
warpReduce(sdata, tid);
// write result for this block to global mem TECHNO
OF LO
TE

if ( tid == 0)

GY
ITU
IAN INST

KH
ARAGPUR
IND
g_odata [ blockIdx . x ] = sdata [0];

19 5 1

yog, kms kOflm

}
Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 5: warpReduce

_device__ void warpReduce ( int * sdata , int tid ) {

sdata [ tid ] += sdata [ tid + 32];
sdata [ tid ] += sdata [ tid + 16];
sdata [ tid ] += sdata [ tid + 8];
sdata [ tid ] += sdata [ tid + 4];
sdata [ tid ] += sdata [ tid + 2];
sdata [ tid ] += sdata [ tid + 1];
}

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 5: Analysis
Array Size: 226
Threads/Block: 1024
GPU used: Tesla K40m

Problems:
Reduction Time Bandwidth
I Still have iterations
Unit Second GB/Second
I loop overhead Reduce 1 0.03276 8.1951
Reduce 2 0.02312 11.6117
Reduce 3 0.01939 13.839
Reduce 4 0.01104 24.3098
Reduce 5 0.00836 32.1053 TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 6: Complete Unrolling

If number of iterations is known at compile time, could completely unroll the reduction.
I Block size is limited to 512 or 1024 threads
I Block size should be of power-of-2
I For a fixed block size, complete unrolling is easy
I For generic implementation, solution is-
I CUDA supports C++ template parameters on device and host functions
I Block size can be specified as a function template parameter

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 5: Kernel
__global__ void reduce5 ( int * g_idata , int * g_odata , unsigned int n ) {
extern __shared__ int sdata [];
// reading from global memory , writing to shared memory
unsigned int tid = threadIdx . x ;
unsigned int i = blockIdx . x * ( blockDim . x * 2) + threadIdx . x ;
sdata [ tid ] = g_idata [ i ] + g_idata [ i + blockDim . x ];
__syncthreads () ;
// do reduction in shared mem
for (unsigned int s=blockDim.x/2; s>32; s»=1) {
if ( tid < s )
sdata [ tid ] += sdata [ tid + s ];
__syncthreads () ;
}
if ( tid < 32)
warpReduce ( sdata , tid ) ;
// write result for this block to global mem
if ( tid == 0) TE
OF
TECHNO
LO

GY
ITU
g_odata [ blockIdx . x ] = sdata [0];

IAN INST

KH
ARAGPUR
IND

} 19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 6: Kernel
Specify block size as a function template parameter and all code highlighted in
yellow will be evaluated at compile time.
template < unsigned int blockSize >
__global__ void reduce6 ( int * g_idata , int * g_odata , unsigned int n ) {
extern __shared__ int sdata [];
// reading from global memory , writing to shared memory
unsigned int tid = threadIdx . x ;
unsigned int i = blockIdx . x * ( blockDim . x * 2) + threadIdx . x ;
sdata [ tid ] = g_idata [ i ] + g_idata [ i + blockDim . x ];
__syncthreads () ;

// do reduction in shared memory

if (blockSize >= 512) {
if ( tid < 256)
sdata [ tid ] += sdata [ tid + 256];
__syncthreads () ; TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
IND
}

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 6: Kernel
if (blockSize >= 256) {
if ( tid < 128)
sdata [ tid ] += sdata [ tid + 128];
__syncthreads () ;
}
if (blockSize >= 128) {
if ( tid < 64)
sdata [ tid ] += sdata [ tid + 64];
__syncthreads () ;
}
if ( tid < 32)
warpReduce < blockSize >( sdata , tid ) ;

// write result for this block to global mem

if ( tid == 0)
TECHNO
OF

g_odata [ blockIdx . x ] = sdata [0]; TE

GY
ITU
IAN INST

KH
ARAGPUR
}

IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 6: Kernel

Modified warpReduce function:

Template < unsigned int blockSize >
__device__ void warpReduce ( volatile int * sdata , int tid )
{
if (blockSize >= 64) sdata [ tid ] += sdata [ tid + 32];
if (blockSize >= 32) sdata [ tid ] += sdata [ tid + 16];
if (blockSize >= 16) sdata [ tid ] += sdata [ tid + 8];
if (blockSize >= 8) sdata [ tid ] += sdata [ tid + 4];
if (blockSize >= 4) sdata [ tid ] += sdata [ tid + 2];
if (blockSize >= 2) sdata [ tid ] += sdata [ tid + 1];
}
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 6: Invoking Template Kernels
Use a switch statement for possible block sizes while invoking template kernels
switch ( th re a ds P er Bl o ck ) {
case 512: reduce5 <512 > < < < dimGrid , dimBlock , smemSize > > >( d_idata , d_odata ) ;
break ;
case 256: reduce5 <256 > < < < dimGrid , dimBlock , smemSize > > >( d_idata , d_odata ) ;
break ;
case 128: reduce5 <128 > < < < dimGrid , dimBlock , smemSize > > >( d_idata , d_odata ) ;
break ;
case 64: reduce5 <64 > < < < dimGrid , dimBlock , smemSize > > >( d_idata , d_odata ) ;
break ;
case 32: reduce5 <32 > < < < dimGrid , dimBlock , smemSize > > >( d_idata , d_odata ) ;
break ;
case 16: reduce5 <16 > < < < dimGrid , dimBlock , smemSize > > >( d_idata , d_odata ) ;
break ;
case 8: reduce5 <8 > < < < dimGrid , dimBlock , smemSize > > >( d_idata , d_odata ) ;
break ;
case 4: reduce5 <4 > < < < dimGrid , dimBlock , smemSize > > >( d_idata , d_odata ) ; break ;
TECHNO

case 2: reduce5 <2 > < < < dimGrid , dimBlock , smemSize > > >( d_idata , d_odata ) ; break ; TE
OF LO

GY
ITU
IAN INST

KH
ARAGPUR
case 1: reduce5 <1 > < < < dimGrid , dimBlock , smemSize > > >( d_idata , d_odata ) ; break ;

IND

19 5 1

} yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 6: Analysis
Array Size: 226
Threads/Block: 1024
GPU used: Tesla K40m

I Algorithm Reduction Time Bandwidth

Cascading can lead Unit Second GB/Second
to significant Reduce 1 0.03276 8.1951
speedups in practice Reduce 2 0.02312 11.6117
Reduce 3 0.01939 13.839
Reduce 4 0.01104 24.3098
Reduce 5 0.00836 32.1053 TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
Reduce 6 0.00769 34.9014

IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 7: Multiple Adds / Thread

Algorithm Cascading:
I Combine sequential and parallel reduction
I Each thread loads and sums multiple elements into shared memory
I Tree-based reduction in shared memory
I Replace load and add two elements
I With a loop to add as many as necessary

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 6: Kernel

global void reduce6 ( int * g_idata , int * g_odata , unsigned int n ) {

extern __shared__ int sdata [];
// reading from global memory , writing to shared memory
unsigned int tid = threadIdx . x ;
unsigned int i = blockIdx . x * ( blockDim . x * 2) + threadIdx . x ;
sdata[tid] = g_idata[i] + g_idata[i+blockDim.x];
__syncthreads () ;

// do reduction in shared mem

...
// write result for this block to global mem
...
}

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 7: Kernel
__global__ void reduce7 ( int * g_idata , int * g_odata , unsigned int n )
{
extern __shared__ int sdata [];
// reading from global memory , writing to shared memory
unsigned int tid = threadIdx . x ;
unsigned int i = blockIdx . x * ( blockDim . x * 2) + threadIdx . x ;
unsigned int gridSize = blockSize*2*gridDim.x;
sdata[tid] = 0;
while (i < n) {
sdata[tid] += g_idata[i] + g_idata[i+blockSize];
i += gridSize;
}
__syncthreads () ;
// do reduction in shared mem
...
TECHNO
OF

// write result for this block to global mem TE

GY
ITU
IAN INST

KH
ARAGPUR
...

IND

19 5 1

}
yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction 7: Analysis
Array Size: 226
Threads/Block: 1024
GPU used: Tesla K40m

Reduction Time Bandwidth

Unit Second GB/Second
Reduce 1 0.03276 8.1951
Reduce 2 0.02312 11.6117
Reduce 3 0.01939 13.839
Reduce 4 0.01104 24.3098
Reduce 5 0.00836 32.1053
Reduce 6 0.00769 34.9014 TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
Reduce 7 0.00277 96.8672

IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction Performance Comparison wrt time in Tesla K40m GPU
Data Size (n)
2^19
60 2^20
2^21
2^22
50 2^23
2^24
2^25
40 2^26
2^27
Time (ms)

0 TECHNO
OF LO
TE

0 1 2 3 4 5 6

GY
ITU
IAN INST

KH
ARAGPUR
Reduction Level

IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction Performance Comparison wrt bandwidth in Tesla K40m GPU
100 Data Size (n)
2^19
2^20
2^21
80 2^22
2^23
2^24
2^25
2^26
Bandwidth (GB/s)

60
2^27

TECHNO
OF LO
TE

0 1 2 3 4 5 6

GY
ITU
IAN INST

KH
ARAGPUR
Reduction Level

IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Types of Optimization

I Algorithmic optimizations
I Changes to addressing, algorithm cascading (Reduction No. 1 to 4, 7)
I Approx 12x speedup
I Code optimizations
I Loop unrolling (Reduction No. 5, 6)
I Approx 3x speedup

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Summary

Kernel Optimization
Reduce1 Interleaved addressing (using modulo arithmetic) with divergent branching
Reduce2 Interleaved addressing (using contiguous threads) with bank conflicts
Reduce3 Sequential addressing, no divergence or bank conflicts
Reduce4 Uses n/2 threads, performs first level during global load
Reduce5 Unrolled loop for last warp, intra-warp synchronisation barriers removed
Completely unrolled, using template parameter to assert whether the number
Reduce6
of threads is a power of two
Multiple elements per thread, small constant number of thread blocks
Reduce7
launched. Requires very few synchronisation barriers OF
TECHNO
LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Example of Applications on Reduction

I Bitonic Sort
I Prefix sum

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Problem: Sorting

I Sort any random permutation of numbers in ascending or descending order

I Basic introduction to sorting networks
I Focus on a comparison based sort - Bitonic Sort
I Discuss how operations can be parallelized using CUDA.

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Sorting Networks

A sorting network is composed of two elements

I Wires: Wires run from left to right, carrying values (one per wire) that traverse
the network all at the same time.
I Comparators: Comparators connect two wires. When a pair of values, traveling
through a pair of wires, encounter a comparator, the comparator may or may not
swap the values.

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Comparator

a min(a,b) a max(a,b)

Ascending (TRUE) Descending (FALSE)

b max(a,b) b min(a.b)

Figure: Comparator Function

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
A Simple Sorting Network

Sort four numbers a, b, c, d in ascending order where a > b > c > d

b d d d

b c c

c d b b b

d c c

Figure: Sorting Four Numbers

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Bubble Sort

Any comparison based sort can be done using a sorting network. TE

OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Bitonic Sort

Bitonic sort takes place using two fundamental steps:

I Step I: Convert an arbitrary sequence to a bitonic sequence.
I Step II: Convert a bitonic sequence to a sorted sequence.
A Bitonic Sequence is a sequence of numbers which is first strictly increasing then after
a point strictly decreasing. a1 < a2 < ... < am > b1 > b2 > ... > bn

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Bitonic Sort

7 7 7 7 7 3 2
9 9 8 8 4 2 3
8 11 11 9 3 7 4
11 8 9 11 2 4 7
12 3 4 12 12 9 8
3 12 12 4 8 8 9
4 4 3 3 9 12 11
2 2 2 2 11 11 12

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
Step I Step II

IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Recursive Structure

I If you look closely, Step I uses Step II recursively on smaller sequences.

I Step II can be used to sort in any order (ascending or descending). The order can
be controlled using the comparator.
I Step I uses Step II in a way to construct subseqences that are bitonic in nature.
http://www.iti.fh-flensburg.de/lang/algorithmen/sortieren/bitonic/bitonicen.htm

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Recursive C Program

// Comparator
void compare ( int i , int j , boolean dir ) {
if ( dir ==( a [ i ] > a [ j ]) )
exchange (i , j ) ;
}

// Step II
void bitonicMerge ( int lo , int n , boolean dir ) {
if (n >1) {
int m = n /2;
for ( int i = lo ; i < lo + m ; i ++)
compare (i , i +m , dir ) ;
bitonicMerge ( lo , m , dir ) ;
bitonicMerge ( lo +m , m , dir ) ;
}
} OF
TECHNO
LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Recursive C Program

// Step I
void bitonicSort ( int lo , int n , boolean dir ) {
if (n >1)
{
int m = n /2;
bitonicSort ( lo , m , ASCENDING ) ;
bitonicSort ( lo +m , m , DESCENDING ) ;
bitonicMerge ( lo , n , dir ) ;
}
}

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Scope for Parallelization

7 7 7 7 7 3 2
9 9 8 8 4 2 3
8 11 11 9 3 7 4
11 8 9 11 2 4 7
12 3 4 12 12 9 8
3 12 12 4 8 8 9
4 4 3 3 9 12 11
2 2 2 2 11 11 12 TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Formulate Parallel Solution

I Associate every cuda thread block with a sorting subproblem.

I Merge results from each SM to solve the original sorting problem.

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Mapping Sorting Subproblem

7 7 7 7 7 3 2
9 9 8 8 4 2 3
8 11 11 9 3 7 4
11 8 9 11 2 4 7 Thread Block
4 threads for
12 3 4 12 12 9 8 eight elements

3 12 12 4 8 8 9
Shared Memory Size
4 4 3 3 9 12 11 =
2 * Number of Threads
2 2 2 2 11 11 12 TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Comparator

__device__ inline void Comparator ( uint & keyA , uint & valA , uint & keyB , uint & valB
, uint dir ) {
uint t ;
if (( keyA > keyB ) == dir ) {
t = keyA ;
keyA = keyB ;
keyB = t ;
t = valA ;
valA = valB ;
valB = t ;
}
}

NVIDIA CUDA SDK Benchmark Suite

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Sort in Shared Memory

global void b i t o n i c S o r t S h a r e d 1 ( uint * d_DstKey , uint * d_DstVal , uint *

d_SrcKey , uint * d_SrcVal ) {
// Shared memory storage for current subarray
__shared__ uint s_key [ SHARED_SIZE ];
__shared__ uint s_val [ SHARED_SIZE ];

// Offset to the beginning of subarray and load data

d_SrcKey += blockIdx . x * SHARED_SIZE + threadIdx . x ;
d_SrcVal += blockIdx . x * SHARED_SIZE + threadIdx . x ;
d_DstKey += blockIdx . x * SHARED_SIZE + threadIdx . x ;
d_DstVal += blockIdx . x * SHARED_SIZE + threadIdx . x ;

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
s_key [ threadIdx . x +0]= d_SrcKey [0];
s_val [ threadIdx . x +0]= d_SrcVal [0];
s_key [ threadIdx . x + SHARED_SIZE /2]= d_SrcKey [( SHARED_SIZE /2];
s_val [ threadIdx . x + SHARED_SIZE /2]= d_SrcVal [( SHARED_SIZE /2];
// strided load by threads , each thread loads two elements
for ( size =2; size < SHARED_SIZE ; size < <=1) {
// Bitonic merge
uint ddd =( threadIdx . x &( size /2) ) !=0;
for ( stride = size /2; stride >0; stride > >=1) {
__syncthreads () ;
pos =2* threadIdx .x - threadIdx . x &( stride -1) ;
Comparator ( s_key [ pos +0] , s_val [ pos +0] ,
s_key [ pos + stride ] , s_val [ pos + stride ] , ddd ) ;
}
}

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Important Variables

I Size denotes the length of bitonic sequences being generated.

I Pos denotes the position of the first item being processed by the thread and is
dependent on thread id and stride.
I Stride denotes the distance between the position of numbers being sorted by a
thread.
I ddd represents direction and is dependent on thread id and size.

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Bitonic Sequence Creation
size = 2 size = 4

stride = 1 stride = 2 stride = 1

7 7 7 7
0 0 0
9 9 8 8
1 ddd = (1&(4/2)! = 0) = 1
1 8 11 11 9
1 1 pos = 2 ∗ 1 − 1&(2 − 1) = 1
11 8 9 11
pos + stride = 1 + 1 = 2
2 12 3 4 12
2 2 2
3 12 12 4
4 4 3 3 33
3 3
2 2 2 2 TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
// sort in opposite directions odd / even block ids
uint ddd = ( blockIdx . x + 1) & 1;
for ( stride = SHARED_SIZE /2; stride >0; stride > >=1) {
__syncthreads () ;
pos =2* threadIdx .x - threadIdx . x &( stride - 1) ) ;
Comparator ( s_key [ pos +0] , s_val [ pos +0] ,
s_key [ pos + stride ] , s_val [ pos + stride ] , ddd ) ;
}
__syncthreads () ;
d_DstKey [0]= s_key [ threadIdx . x +0];
d_DstVal [0]= s_val [ threadIdx . x +0];
d_DstKey [ SHARED_SIZE /2]= s_key [ threadIdx . x + SHARED_SIZE /2];
d_DstVal [ SHARED_SIZE /2]= s_val [ threadIdx . x + SHARED_SIZE /2];
}

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Sorting Bitonic Sequence
Block
= (0 + 1)1 = 1
stride=4 stride=2 stride=
7 7 3 2
8 4 2 3
9 3 7 4
11 2 4 7
12 12 9 8
4 8 8 9
3 9 12 11
2 11 11 12 TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Example Applications of reduction

I Bitonic Sort
I Prefix Sum

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
All-Prefix-Sums

The all-prefix-sums operation takes a binary associative operator ⊕ , and an array of n

elements.
[a0 , a1 , ..., an−1 ],
and returns the array
[a0 , (a0 ⊕ a1 ), ..., (a0 ⊕ a1 ⊕, ... ⊕ an−1 )].

Example: If ⊕ is addition, then the all-prefix-sums operation on the array

[3 1 7 0 4 1 6 3],
would return
[3 4 11 11 15 16 22 25].
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Inclusive and Exclusive scan

All-prefix-sums on an array of data is commonly known as scan.

I Inclusive scan - a scan of an array generates a new array where each element j is
the sum of all elements up to and including j.
I Exclusive scan - a scan of an array generates a new array where each element j is
the sum of all elements excluding j.
The exclusive scan operation takes a binary associative operator ⊕ with identity I , and
an array of n elements
[a0 , a1 , ..., an−1 ],
and returns the array
[I , a0 , (a0 ⊕ a1 ), ..., (a0 ⊕ a1 ⊕, ... ⊕ an−2 )]. OF
TECHNO
LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Example - Inclusive Scan
0 1 2 3 4 5 6 7 Array Indexes

3 1 7 0 4 1 6 3 Input Array

3 1 7 0 4 1 6 3

3 4 7 0 4 1 6 3

3 4 11 0 4 1 6 3
Output Array

3 4 11 11 4 1 6 3

3 4 11 11 15 1 6 3

3 4 11 11 15 16 6 3

3 4 11 11 15 16 22 3
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
3 4 11 11 15 16 22 25

IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Sequential Code

Inclusive Scan
void scan ( float * output , float * input , int length )
{
output [0] = input [0]; // since this is a inclusive scan

for ( int j = 1; j < length ; ++ j )

output [ j ] = output [j -1] + input [ j ];

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Sequential Code - Complexity

I Code performs exactly n − 1 adds for an array of length n

I Work complexity is O(n)
I Very large n, motivate parallel execution

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Parallel Prefix Sum - Hillis/Steel Scan (Algorithm 1)
for d = 1 to log2 n do
forall k ≥ in parallel do
x[k] = x[k − 2d−1 ] + x[k]

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Example- Hillis/Steel Inclusive Scan

1 2 3 4 5 6 7 8
Step-I
Add elements 20
step away
1 3 5 7 9 11 13 15
Step-II
Add elements 21
step away
1 3 6 10 14 18 22 26

Step-III
Add elements 22
step away

1 3 6 10 15 21 28 36
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
Nos of steps O(log n)

IND

Work O (n * log n)
19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Analysis

I The algorithm performs O(nlog2 n) additions operations.

I Remember that a sequential scan performs O(n) adds. Therefore, this naïve
implementation is not work-efficient.
I Algorithm 1 assumes that there are as many processors as data elements. On a
GPU running CUDA, this is not usually the case.
I Instead, the forall is automatically divided into small parallel batches (called
warps) that are executed sequentially on a multiprocessor.
I The algorithm 1 will not work because it performs the scan in place on the array.
The results of one warp will be overwritten by threads in another warp.
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
A double-buffered version of the sum scan from Algorithm 1

Algorithm 2
for d := 1 to log2 n do
forall k in parallel do
if k ≥ 2d then
x[out][k] := x[in][k-2d−1 ] + x[in][k]
else
x[out][k] := x[in][k]
swap(in,out)

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
CUDA C code - Algorithm 1
global__ void scan ( float * g_odata , float * g_idata , int n )
{
extern __shared__ float temp []; // allocated on invocation
int thid = threadIdx . x ;
int pout = 0 , pin = 1;
// For exclusive scan , shift right by one and set first elt to 0
temp [ pout * n + thid ] = ( thid > 0) ? g_idata [ thid -1] : 0;
__syncthreads () ;
for ( int offset = 1; offset < n ; offset *= 2)
{
pout = 1 - pout ;
pin = 1 - pout ;
if ( thid >= offset )
temp [ pout * n + thid ] += temp [ pin * n + thid - offset ];
else
temp [ pout * n + thid ] = temp [ pin * n + thid ];
__syncthreads () ;
} TE
OF
TECHNO
LO

GY
ITU
IAN INST

KH
ARAGPUR
g_odata [ thid ] = temp [ pout * n + thid1 ];

IND

19 5 1

} yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Parallel Prefix Sum - Blelloch Scan

I The idea is to build a balanced binary tree on the input data and sweep it to and
from the root to compute the prefix sum.
I A binary tree with n leaves has logn levels, and each level d ∈ [0, n) has 2d nodes.
I If we perform one add per node, then we will perform O(n) adds on a single
traversal of the tree.

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Parallel Prefix Sum - Blelloch Scan

The algorithm consists of two phases:

I Reduction Phase: we traverse the tree from leaves to root computing partial
sums at internal nodes of the tree
I Down Sweep Phase: We traverse back up the tree from the root, using the
partial sums to build the scan in place on the array using the partial sums
computed by the reduce phase.

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Blelloch Scan - Reduce Phase

for d:= 0 to log2 n − 1 do

for k from 0 to n − 1 by 2d+1 in paralle do
x[k + 2d+1 − 1] := x[k + 2d − 1] + x[k + 2d+1 − 1]

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Blelloch Scan - Reduce Phase

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Blelloch Scan - Down-Sweep Phase

x[n − 1] := 0
for d:= log2 n down to 0 do
for k from 0 to n − 1 by 2d+1 in parallel do
t := x[k + 2d − 1]
x[k + 2d − 1] := x[k + 2d+1 − 1]
x[k + 2d+1 − 1] := t + x[k + 2d+1 − 1]

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Blelloch Scan - Down-Sweep Phase

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Example - Blelloch Exclusive Scan

1 2 3 4 5 6 7 8
Input

3 7 11 15 Reduce Step 0

10 26 Reduce Step 1

36 Reduce Step 2

10 0 Identity Elements

3 0 11 10 Down Sweep 0

1 0 3 3 5 10 7 21 Down Sweep 1

TECHNO
OF LO
TE

Down Sweep 2

GY
ITU
0 1 3 6 10 15 21 28

IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Cuda code
__global__ void prescan ( float * g_odata , float * g_idata , int n )
{
extern __shared__ float temp []; // allocated on invocation
int thid = threadIdx . x ;
int offset = 1;

temp [2* thid ] = g_idata [2* thid ]; // load input into shared memory
temp [2* thid +1] = g_idata [2* thid +1];
for ( int d = n > >1; d > 0; d > >= 1) // build sum in place up the tree
{
__syncthreads () ;
if ( thid < d )
{
int ai = offset *(2* thid +1) -1;
int bi = offset *(2* thid +2) -1;
temp [ bi ] += temp [ ai ];
}
TECHNO
OF

offset *= 2; TE
LO

GY
ITU
IAN INST

KH
ARAGPUR
IND
}

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Cuda code
if ( thid == 0)
temp [ n - 1] = 0; // clear the last element
for ( int d = 1; d < n ; d *= 2) // traverse down tree & build scan
{
offset > >= 1;
__syncthreads () ;
if ( thid < d )
{
int ai = offset *(2* thid +1) -1;
int bi = offset *(2* thid +2) -1;
float t = temp [ ai ];
temp [ ai ] = temp [ bi ];
temp [ bi ] += t ;
}
}
__syncthreads () ;
g_odata [2* thid ] = temp [2* thid ]; // write results to device memory
TECHNO
OF

g_odata [2* thid +1] = temp [2* thid +1]; TE

GY
ITU
IAN INST

KH
ARAGPUR
IND
}

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Reduction Conclusion

We have learnt how to-

I Understand CUDA performance characteristics
I Memory coalescing,
I Divergent branching,
I Bank conflicts,
I Latency hiding
I Use peak performance metrics to guide optimization
I Understand parallel algorithm complexity theory
I Identify type of bottleneck and
I Suitably optimize the algorithm
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
References

1. Optimizing Parallel Reduction in CUDA by Mark Harris

White paper available at http://docs.nvidia.com.
2. Reductions and Low-Level Performance Considerations by David Tarjan
3. Parallel Prefix Sum (Scan) with CUDA by Mark Harris
White paper available at
https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html.

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

Optimizing Reduction Kernels Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur

BASH Guide - Joseph DeVeau
100% (2)
BASH Guide - Joseph DeVeau
227 pages
Dja2500 - 4000 Service Manual PDF
73% (22)
Dja2500 - 4000 Service Manual PDF
38 pages
Essentials of SMT: Practical Know-How
From Everand
Essentials of SMT: Practical Know-How
Young Bong Kang
4.5/5 (6)
Optimizing Parallel Reduction in CUDA
No ratings yet
Optimizing Parallel Reduction in CUDA
38 pages
ECE408 MT2 Review FA24
No ratings yet
ECE408 MT2 Review FA24
58 pages
Reduction
No ratings yet
Reduction
9 pages
Ece408 Lecture13 Reduction Tree VK FL24
No ratings yet
Ece408 Lecture13 Reduction Tree VK FL24
45 pages
05 - Atomics - Reductions - Warp - Shuffle 05 - Atomics - Reductions - Warp - Shuffle
No ratings yet
05 - Atomics - Reductions - Warp - Shuffle 05 - Atomics - Reductions - Warp - Shuffle
27 pages
2023 CSC14120 Lecture04 CUDAParallelExecution (P2)
No ratings yet
2023 CSC14120 Lecture04 CUDAParallelExecution (P2)
24 pages
217 Lec10
No ratings yet
217 Lec10
27 pages
Class4 Advanced Cuda Opencl
No ratings yet
Class4 Advanced Cuda Opencl
64 pages
Chap9 - CUDA Optimization
No ratings yet
Chap9 - CUDA Optimization
73 pages
L06 GPGPU CUDA Programming 1
No ratings yet
L06 GPGPU CUDA Programming 1
23 pages
Warp Shuffles, Reduction and Scan Operations: Prof Wes Armour Wes - Armour@eng - Ox.ac - Uk
No ratings yet
Warp Shuffles, Reduction and Scan Operations: Prof Wes Armour Wes - Armour@eng - Ox.ac - Uk
41 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
Discussion Questions 5
No ratings yet
Discussion Questions 5
2 pages
UNIT V Parallel Programming Patterns in CUDA (T2 Chapter 7) - P P With CUDA
No ratings yet
UNIT V Parallel Programming Patterns in CUDA (T2 Chapter 7) - P P With CUDA
35 pages
02 RTVis GPGPU CUDA
No ratings yet
02 RTVis GPGPU CUDA
34 pages
Basic-Cuda
No ratings yet
Basic-Cuda
49 pages
PDSCUDA
No ratings yet
PDSCUDA
11 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
Laboratory Practice I (410246)
No ratings yet
Laboratory Practice I (410246)
28 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
HPCXX 2023 d4
No ratings yet
HPCXX 2023 d4
52 pages
Sum Reduction Week10 Lec30
No ratings yet
Sum Reduction Week10 Lec30
8 pages
Endsem Imp HPC Unit 5
No ratings yet
Endsem Imp HPC Unit 5
24 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
ECE408 S19 ZJUI Exam1 Study Guide
No ratings yet
ECE408 S19 ZJUI Exam1 Study Guide
25 pages
CUDA Part-1
No ratings yet
CUDA Part-1
52 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
CUDA Putting It All Together
No ratings yet
CUDA Putting It All Together
39 pages
OpenCL Guide
No ratings yet
OpenCL Guide
19 pages
s7122 Stephen Jones Cuda Optimization Tips Tricks and Techniques
No ratings yet
s7122 Stephen Jones Cuda Optimization Tips Tricks and Techniques
71 pages
Cuda Notes From Udacity Lecture
No ratings yet
Cuda Notes From Udacity Lecture
3 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
Parralel Demro 003
No ratings yet
Parralel Demro 003
46 pages
Cs344 - Lesson 2 - GPU Hardware and Parallel Communication Patterns - Udacity
No ratings yet
Cs344 - Lesson 2 - GPU Hardware and Parallel Communication Patterns - Udacity
14 pages
DS1822 ParallelComputing Unit4
No ratings yet
DS1822 ParallelComputing Unit4
16 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Osdi23 Slides Zhao
No ratings yet
Osdi23 Slides Zhao
68 pages
HPC Int2 Key
No ratings yet
HPC Int2 Key
10 pages
Week 11
No ratings yet
Week 11
21 pages
03 Ch1 BasicArch Parallel
No ratings yet
03 Ch1 BasicArch Parallel
79 pages
14 Parallel Algorithms CUDA Basics s20
No ratings yet
14 Parallel Algorithms CUDA Basics s20
89 pages
106105220
No ratings yet
106105220
993 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
Cuda Program + Wait For User Input
No ratings yet
Cuda Program + Wait For User Input
2 pages
08 Dataparallel
No ratings yet
08 Dataparallel
51 pages
06-CUDA Thread Organization
No ratings yet
06-CUDA Thread Organization
27 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
L 2 GPU
No ratings yet
L 2 GPU
11 pages
Graphics Processing Unit (GPU) Architecture and Programming: TU/e 5kk73 Zhenyu Ye Henk Corporaal 2011-11-15
No ratings yet
Graphics Processing Unit (GPU) Architecture and Programming: TU/e 5kk73 Zhenyu Ye Henk Corporaal 2011-11-15
53 pages
Gujarat Technological University: W.E.F. AY 2018-19
No ratings yet
Gujarat Technological University: W.E.F. AY 2018-19
3 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
Intro To CUDA
No ratings yet
Intro To CUDA
76 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
A Jump Start To Opencl: March 15, 2009 Cis 565/665 - Gpu Computing and Architecture
No ratings yet
A Jump Start To Opencl: March 15, 2009 Cis 565/665 - Gpu Computing and Architecture
74 pages
2023 CSC14120 Lecture01 CUDAIntroduction
No ratings yet
2023 CSC14120 Lecture01 CUDAIntroduction
32 pages
CUDA PPT Anurita Unit3
No ratings yet
CUDA PPT Anurita Unit3
42 pages
Mem-Coalesce
No ratings yet
Mem-Coalesce
69 pages
Gpu-Arc
No ratings yet
Gpu-Arc
37 pages
Multi - Dim
No ratings yet
Multi - Dim
29 pages
Wave Guides
No ratings yet
Wave Guides
29 pages
Module 1 and Module 2
No ratings yet
Module 1 and Module 2
56 pages
Module 3 Antenna Part
No ratings yet
Module 3 Antenna Part
35 pages
Double Stub and LC Matching Circuit
No ratings yet
Double Stub and LC Matching Circuit
31 pages
Tugas Inggris Ridwan TaufikC1B230115 An23 Kls Pesantren
No ratings yet
Tugas Inggris Ridwan TaufikC1B230115 An23 Kls Pesantren
5 pages
DLL Mapeh-5 Q2
No ratings yet
DLL Mapeh-5 Q2
99 pages
Bar Velocities Capable of Optimising The Muscle Power in Strength-Power Exercises
No ratings yet
Bar Velocities Capable of Optimising The Muscle Power in Strength-Power Exercises
9 pages
K. Palepu - Business Analysis Valuation - Ch.1
No ratings yet
K. Palepu - Business Analysis Valuation - Ch.1
40 pages
Steps in Making Item Analysis
No ratings yet
Steps in Making Item Analysis
5 pages
A Concise Encyclopedia of Islam
100% (8)
A Concise Encyclopedia of Islam
257 pages
English Form 3 Sameeco 2023
100% (1)
English Form 3 Sameeco 2023
88 pages
Instructions: Meet DRU - The World's First Pizza Delivery Robot!
No ratings yet
Instructions: Meet DRU - The World's First Pizza Delivery Robot!
9 pages
121 Class Test (E-TMAS) Student
No ratings yet
121 Class Test (E-TMAS) Student
2 pages
At Home and Abroad
No ratings yet
At Home and Abroad
6 pages
(STUDI KASUS: Yayasan Sosial Dana Priangan) : Perancangan Sistem Informasi Museum Budaya Tionghoa Bandung Berbasis Web
No ratings yet
(STUDI KASUS: Yayasan Sosial Dana Priangan) : Perancangan Sistem Informasi Museum Budaya Tionghoa Bandung Berbasis Web
6 pages
CRC
No ratings yet
CRC
35 pages
Week 1 - Lecture Prinsip Perakaunan Principles of Accounting (Bt11003)
No ratings yet
Week 1 - Lecture Prinsip Perakaunan Principles of Accounting (Bt11003)
30 pages
Choosing A Course Booklet 2022
No ratings yet
Choosing A Course Booklet 2022
9 pages
Is Brain A Good Model For AI
No ratings yet
Is Brain A Good Model For AI
2 pages
Ethico-Moral and Legal Foundations of Client Education-Venus
No ratings yet
Ethico-Moral and Legal Foundations of Client Education-Venus
3 pages
Shri Chinai College of Commerce and Economics Andheri (East), Mumbai-400 069 Bachlor of Management Studies Project Report On "Marketing Strategy of Samsung" Submitted by Pinak Varu Tybms B (Sem. V
No ratings yet
Shri Chinai College of Commerce and Economics Andheri (East), Mumbai-400 069 Bachlor of Management Studies Project Report On "Marketing Strategy of Samsung" Submitted by Pinak Varu Tybms B (Sem. V
33 pages
A Detailed Lesson Plan in Science Grade 7
No ratings yet
A Detailed Lesson Plan in Science Grade 7
10 pages
2.multiple Currencies in Purchase Order Release Strategy
No ratings yet
2.multiple Currencies in Purchase Order Release Strategy
4 pages
VBM Phase 4 US Europe Updated 22 July 20
No ratings yet
VBM Phase 4 US Europe Updated 22 July 20
3 pages
Archlinux - Grub
No ratings yet
Archlinux - Grub
15 pages
Unit - 6 Promotion Decisions: Jacqueline
No ratings yet
Unit - 6 Promotion Decisions: Jacqueline
22 pages
Mosdorfer Catalog Clamps
No ratings yet
Mosdorfer Catalog Clamps
44 pages
PK Flipped Lectures Answers
No ratings yet
PK Flipped Lectures Answers
12 pages
Python Class 11 Test Gen 002
No ratings yet
Python Class 11 Test Gen 002
6 pages
CT 230
No ratings yet
CT 230
21 pages
WBCS Preliminary Exam Solved Question Paper 2015 (English Version) - BengalStudents
No ratings yet
WBCS Preliminary Exam Solved Question Paper 2015 (English Version) - BengalStudents
13 pages
CHY Brochure A4 72pg V11
No ratings yet
CHY Brochure A4 72pg V11
72 pages