0% found this document useful (0 votes)
23 views8 pages

Sum Reduction Week10 Lec30

This document discusses parallel reduction in CUDA. It describes a tree-based approach used within each thread block to reduce portions of an array in parallel. It also addresses how to communicate partial results between thread blocks to process very large arrays. The solution is to decompose the computation into multiple kernel invocations in a recursive manner, with the code for each level being the same. The optimization goal is to achieve peak bandwidth, as reductions have very low arithmetic intensity. Algorithmic and code optimizations can provide speedups.

Uploaded by

wardabibi69
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views8 pages

Sum Reduction Week10 Lec30

This document discusses parallel reduction in CUDA. It describes a tree-based approach used within each thread block to reduce portions of an array in parallel. It also addresses how to communicate partial results between thread blocks to process very large arrays. The solution is to decompose the computation into multiple kernel invocations in a recursive manner, with the code for each level being the same. The optimization goal is to achieve peak bandwidth, as reductions have very low arithmetic intensity. Algorithmic and code optimizations can provide speedups.

Uploaded by

wardabibi69
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Parallel Reduction in CUDA

Week10_Lecture 30
Parallel Reduction

Common and important data parallel primitive

Easy to implement in CUDA

Serves as a great optimization example

2
Parallel Reduction

Tree-based approach used within each thread block


3 1 7 0 4 1 6 3

4 7 5 9

11 14

25

Need to be able to use multiple thread blocks


To process very large arrays
To keep all multiprocessors on the GPU busy
Each thread block reduces a portion of the array
But how do we communicate partial results between
thread blocks?
3
Parallel Reduction

Values (shared memory) 10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2

Step 1 Thread
Stride 1 IDs
0 2 4 6 8 10 12 14

Values 11 1 7 -1 -2 -2 8 5 -5 -3 9 7 11 11 2 2

Step 2 Thread
Stride 2 IDs
0 4 8 12

Values 18 1 7 -1 6 -2 8 5 4 -3 9 7 13 11 2 2

Step 3 Thread
Stride 4 IDs
0 8

Values 24 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2

Step 4 Thread
0
Stride 8 IDs
Values 41 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2

4
Solution: Kernel Decomposition

decomposing computation into multiple kernel


invocations

3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3
4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9
11 14 11 14 11 14 11 14 11 14 11 14 11 14 11 14
25 25 25 25 25 25 25 25 Level 0:
8 blocks

3 1 7 0 4 1 6 3
4 7 5 9
Level 1:
11
25
14
1 block

In the case of reductions, code for all levels is the


same
Recursive kernel invocation

5
What is Our Optimization Goal?

We should strive to reach GPU peak performance


Choose the right metric:
GFLOP/s: for compute-bound kernels
Bandwidth: for memory-bound kernels
Reductions have very low arithmetic intensity
1 flop per element loaded (bandwidth-optimal)
Therefore we should strive for peak bandwidth

6
Types of optimization

Interesting observation:

Algorithmic optimizations
Changes to addressing, algorithm
cascading

Code optimizations
Loop unrolling
2.54x speedup, combined

7
Conclusion
Understand CUDA performance characteristics
Memory coalescing
Divergent branching
Bank conflicts
Latency hiding
Use peak performance metrics to guide optimization
Understand parallel algorithm complexity theory
Know how to identify type of bottleneck
e.g. memory, core computation, or instruction overhead
Optimize your algorithm, then unroll loops
Use template parameters to generate optimal code

You might also like