0% found this document useful (0 votes)
22 views24 pages

2023 CSC14120 Lecture04 CUDAParallelExecution (P2)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views24 pages

2023 CSC14120 Lecture04 CUDAParallelExecution (P2)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Parallel Programming

Parallel Execution in CUDA


(Part 2)

Phạm Trọng Nghĩa


[email protected]
Overview
Apply knowledge of parallel execution in CUDA to
write a fast CUDA program doing “reduction”
• The “reduction” task
• Sequential implementation
• Parallel implementation
• Kernel function – 1st version
• Kernel function – 2nd version: reduce warp divergence
• Kernel function – 3rd version: reduce warp divergence + …

2
The “reduction” task
Input: an array in of n numbers
Output: sum (or product, max, min, …) of these numbers

n=8
in 1 9 5 1 6 4 7 2

Reduce – sum

35
3
Reduction algorithm
• Sum reduction sequential:
sum = 0;
for (int i = 0;i < N;i++){
sum += input[i];
}

• General form of a reduction sequential:


acc = IDENTITY;
for (int i = 0;i < N;i++){
acc = Operator(acc, input[i]);
}

4
Sequential implementation
1 9 5 1 6 4 7 2
Time step 1 +
10
int reduceOnHost(int *in, int n)
{ Time step 2 +
int s = in[0]; 15
for (int i = 1; i < n; i++)
s += in[i]; Time step 3 +
return s; 16
}
Time step 4 +
22
Time step 5 +
Time (# time steps): 7 = n-1 = O(n)
26
Work (# pluses): 7 = n-1 = O(n)
Time step 6 +
33
Time step 7 +
35
5
Parallel implementation – idea
1 9 5 1 6 4 7 2
Time step 1 + + + +
10 6 10 9
Time step 2 + +
16 19
Time step 3 +
35

Time: ?
Work: ?

6
Parallel implementation – idea

• For N input values, the reduction tree performs


• N/2 + N/4 + N/8 +…+ 1 = N -1 operations
• In Log (N) steps – 1,000,000 input values take 20 steps
• Assuming that we have enough execution resources
• Average Parallelism (N-1)/Log(N))
• For N = 1,000,000, average parallelism is 50,000
• However, peak resource requirement is 500,000
• This is not resource efficient

• This is a work-efficient parallel algorithm


• The amount of work done is comparable to the an efficient sequential
algorithm
• Many parallel algorithms are not work efficient
7
Reduction trees
• Order of performing the operations will be changed
(sequential ➔ parallel)
• Operator must be associative
• Serial
• ((((((3 max 1) max 7) max 0) max 4) max 1) max 6) max 3
• Paralell
• ((3 max 1) max (7 max 0)) max ((4 max 1) max (6 max 3))

• We also need rearranges the order of the operands


• Operator to be commutative

8
Parallel Reduction in Real life
• Sports & Competitions: Max reduction

• Also use to process large input data sets (Google and Hadoop
MapReduce frameworks)
• There is no required order of processing elements in a data set
(associative and commutative)
• Partition the data set into smaller chunks
• Have each thread to process a chunk
• Use a reduction tree to summarize the results from each chunk into the
final answer
9
Parallel implementation – idea
1 9 5 1 6 4 7 2
Time step 1 + + + +
Need synchronization before next step 10 6 10 9
But: in a kernel function, we can Time step 2 + +
only synchronize threads in the
same block 16 19
If n ≤ 2×block-size, we can use a Time step 3 +
kernel with one block
35
If n > 2×block-size, what should
we do? Time: 3 = log2n = O(log2n)
Work: 7 = n-1 = O(n) = work of the sequential version
(Later, we will see tasks in which parallel implementations
need to do more work than sequential)

10
A simple reduction kernel

11
A simple reduction kernel

__global__ void reduceBlksKernel0(int* in, int* out, int n) {


int i = 2 * threadIdx.x;
for (int stride = 1; stride <= blockDim.x; stride *= 2) {
if (threadIdx.x % stride == 0)
in[i] += in[i + stride];
__syncthreads();
}
if (threadIdx.x == 0)
*out = in[0];
}

12
Hierarchical reduction for bigger input

13
Parallel implementation
– idea to reduce within each block
Consider a block of 4 threads
Data of previous block 1 9 5 1 6 4 7 2 Data of next block

threadIdx.x 0 1 2 3
10 9 6 1 10 4 9 2
threadIdx.x 0 2
16 9 5 1 19 4 7 2
threadIdx.x 0
35 9 5 1 19 4 7 2

14
Hierarchical reduction for arbitrary
input length
__global__ void reduceBlksKernel1(int* in, int* out, int n){
int i = blockIdx.x * 2 * blockDim.x + 2 * threadIdx.x;
for (int stride = 1; stride <= blockDim.x; stride *= 2){
if (threadIdx.x % stride == 0)
if (i + stride < n)
in[i] += in[i + stride];
__syncthreads(); // Synchronize within each block
}
if (threadIdx.x == 0)
atomicAdd(out, in[blockIdx.x * 2 * blockDim.x]);
}

Assume: 2×block-size = 2𝑘
15
In each block, how many diverged warps?
(not consider blocks in the edge)

• Stride = 1:
• All threads are “on”
• → No diverged warp
• Stride = 2:
• Only threads with threadIdx.x % 2 == 0 are “on”
• → All warps are diverged
• Stride = 4, 8, …, 32:
• All warps are diverged
• Stride = 64, 128, …:
• # diverged warps decrease to 1 16
Kernel function – 2nd version:
reduce warp divergence
• Idea: reduce # diverged warps in each step by rearranging
threads so that “on” threads are first adjacent threads
• Example: consider a block of 128 threads
• Stride = 1: All 128 threads are “on”
• Stride = 2: First 64 threads are “on”, the rest are “off”
• Stride = 4: First 32 threads are “on”, the rest are “off”
• …

17
Kernel function - 2nd version:
reduce warp divergence
Consider a block of 4 threads
Data of previous block 1 9 5 1 6 4 7 2 Data of next block

threadIdx.x 0 1 2 3
10 9 6 1 10 4 9 2
threadIdx.x 0 1
16 9 5 1 19 4 7 2
threadIdx.x 0
35 9 5 1 19 4 7 2

18
Kernel function – 2nd version:
reduce warp divergence
__global__ void reduceOnDevice2(int *in, int *out, int n)
{
int numElemsBeforeBlk = blockIdx.x * blockDim.x * 2;

for (int stride = 1; stride < 2 * blockDim.x; stride *= 2)


{
int i = numElemsBeforeBlk + ...;
if (threadIdx.x ...)
if (i + stride < n)
in[i] += in[i + stride];
__syncthreads(); // Synchronize within each block
}

if (threadIdx.x == 0)
out[blockIdx.x] = in[numElemsBeforeBlk];
}

19
Kernel function – 2nd version:
reduce warp divergence
__global__ void reduceOnDevice2(int *in, int *out, int n)
{
int numElemsBeforeBlk = blockIdx.x * blockDim.x * 2;

for (int stride = 1; stride < 2 * blockDim.x; stride *= 2)


{
int i = numElemsBeforeBlk + threadIdx.x * 2 * stride;
if (threadIdx.x < blockDim.x / stride)
if (i + stride < n)
in[i] += in[i + stride];
__syncthreads(); // Synchronize within each block
}

if (threadIdx.x == 0)
out[blockIdx.x] = in[numElemsBeforeBlk];
}

20
Kernel function - 3rd version:
reduce warp divergence + ?
Consider a block of 4 threads
Data of previous block 1 9 5 1 6 4 7 2 Data of next block

Stride = 4
threadIdx.x 0 1 2 3

7 13 12 3 6 4 7 2
Stride = 2
threadIdx.x 0 1

19 16 12 3 6 4 7 2
Stride = 1
threadIdx.x 0

35 16 12 3 6 4 7 2
21
Kernel function - 3rd version:
reduce warp divergence + ?
Code: your next homework ;-)

22
Reference
• [1] Wen-Mei, W. Hwu, David B. Kirk, and Izzat El Hajj.
Programming Massively Parallel Processors: A Hands-on
Approach. Morgan Kaufmann, 2022
• [2] Cheng John, Max Grossman, and Ty
McKercher. Professional Cuda C Programming. John Wiley
& Sons, 2014

23
THE END

24

You might also like