0% found this document useful (0 votes)

23 views8 pages

Sum Reduction Week10 Lec30

This document discusses parallel reduction in CUDA. It describes a tree-based approach used within each thread block to reduce portions of an array in parallel. It also addresses how to communicate partial results between thread blocks to process very large arrays. The solution is to decompose the computation into multiple kernel invocations in a recursive manner, with the code for each level being the same. The optimization goal is to achieve peak bandwidth, as reductions have very low arithmetic intensity. Algorithmic and code optimizations can provide speedups.

Uploaded by

wardabibi69

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views8 pages

Sum Reduction Week10 Lec30

Uploaded by

wardabibi69

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 8

Parallel Reduction in CUDA

Week10_Lecture 30
Parallel Reduction

Common and important data parallel primitive

Easy to implement in CUDA

Serves as a great optimization example

2
Parallel Reduction

Tree-based approach used within each thread block

3 1 7 0 4 1 6 3

4 7 5 9

11 14

Need to be able to use multiple thread blocks

To process very large arrays
To keep all multiprocessors on the GPU busy
Each thread block reduces a portion of the array
But how do we communicate partial results between
thread blocks?
3
Parallel Reduction

Values (shared memory) 10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2

Step 1 Thread
Stride 1 IDs
0 2 4 6 8 10 12 14

Values 11 1 7 -1 -2 -2 8 5 -5 -3 9 7 11 11 2 2

Step 2 Thread
Stride 2 IDs
0 4 8 12

Values 18 1 7 -1 6 -2 8 5 4 -3 9 7 13 11 2 2

Step 3 Thread
Stride 4 IDs
0 8

Values 24 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2

Step 4 Thread
0
Stride 8 IDs
Values 41 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2

4
Solution: Kernel Decomposition

decomposing computation into multiple kernel

invocations

3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3
4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9
11 14 11 14 11 14 11 14 11 14 11 14 11 14 11 14
25 25 25 25 25 25 25 25 Level 0:
8 blocks

3 1 7 0 4 1 6 3
4 7 5 9
Level 1:
11
25
14
1 block

In the case of reductions, code for all levels is the

same
Recursive kernel invocation

5
What is Our Optimization Goal?

We should strive to reach GPU peak performance

Choose the right metric:
GFLOP/s: for compute-bound kernels
Bandwidth: for memory-bound kernels
Reductions have very low arithmetic intensity
1 flop per element loaded (bandwidth-optimal)
Therefore we should strive for peak bandwidth

6
Types of optimization

Interesting observation:

Algorithmic optimizations
Changes to addressing, algorithm
cascading

Code optimizations
Loop unrolling
2.54x speedup, combined

7
Conclusion
Understand CUDA performance characteristics
Memory coalescing
Divergent branching
Bank conflicts
Latency hiding
Use peak performance metrics to guide optimization
Understand parallel algorithm complexity theory
Know how to identify type of bottleneck
e.g. memory, core computation, or instruction overhead
Optimize your algorithm, then unroll loops
Use template parameters to generate optimal code

How CUDA Programming Works - 1647539841016001sz6e
No ratings yet
How CUDA Programming Works - 1647539841016001sz6e
101 pages
CS8076 - GPU Architecture and Programming
No ratings yet
CS8076 - GPU Architecture and Programming
244 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
CSED405 Lec5-Threads and Atomics - 240921 - 193053
No ratings yet
CSED405 Lec5-Threads and Atomics - 240921 - 193053
34 pages
Basics CUDA
No ratings yet
Basics CUDA
55 pages
CUDA Optimization Fundamentals
No ratings yet
CUDA Optimization Fundamentals
150 pages
Ece408 Lecture13 Reduction Tree VK FL24
No ratings yet
Ece408 Lecture13 Reduction Tree VK FL24
45 pages
pdc2: MODULE2
No ratings yet
pdc2: MODULE2
113 pages
Nvidia Cuda Thesis
100% (3)
Nvidia Cuda Thesis
8 pages
217 Lec10
No ratings yet
217 Lec10
27 pages
CSE 599 I Accelerated Computing - Programming GPUs Lecture 15
No ratings yet
CSE 599 I Accelerated Computing - Programming GPUs Lecture 15
42 pages
DS1822 ParallelComputing Unit4
No ratings yet
DS1822 ParallelComputing Unit4
16 pages
GPU Programming Slides 3
No ratings yet
GPU Programming Slides 3
73 pages
14 Parallel Algorithms CUDA Basics s20
No ratings yet
14 Parallel Algorithms CUDA Basics s20
89 pages
2023 CSC14120 Lecture04 CUDAParallelExecution (P2)
No ratings yet
2023 CSC14120 Lecture04 CUDAParallelExecution (P2)
24 pages
Parralel Demro 003
No ratings yet
Parralel Demro 003
46 pages
Optimizing Harris Corner Detection On GPGPUs Using CUDA
No ratings yet
Optimizing Harris Corner Detection On GPGPUs Using CUDA
129 pages
Week13 Lecture37
No ratings yet
Week13 Lecture37
19 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
Week12 Lec33 and Lec35
No ratings yet
Week12 Lec33 and Lec35
23 pages
CUDA Programming Basic: High Performance Computing Center Hanoi University of Science & Technology
No ratings yet
CUDA Programming Basic: High Performance Computing Center Hanoi University of Science & Technology
38 pages
Part4 22
No ratings yet
Part4 22
65 pages
Lập Trình Trên Bộ Xử Lý Song Song GPU Có Hỗ Trợ Lõi CUDA
No ratings yet
Lập Trình Trên Bộ Xử Lý Song Song GPU Có Hỗ Trợ Lõi CUDA
18 pages
PDC Example Exam Questions
No ratings yet
PDC Example Exam Questions
9 pages
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
No ratings yet
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
132 pages
Intro To CUDA
No ratings yet
Intro To CUDA
16 pages
Week11 Lec32
No ratings yet
Week11 Lec32
15 pages
Vector Processor
No ratings yet
Vector Processor
83 pages
D. Granularity
No ratings yet
D. Granularity
24 pages
CUDA Lab Instruction
No ratings yet
CUDA Lab Instruction
40 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
Week12 Lec31 and Lec34
No ratings yet
Week12 Lec31 and Lec34
12 pages
Week13 Lecture 38
No ratings yet
Week13 Lecture 38
11 pages
Data-Level Parallelism Presentation (1) Morning 6 35AM
No ratings yet
Data-Level Parallelism Presentation (1) Morning 6 35AM
87 pages
05 - Atomics - Reductions - Warp - Shuffle 05 - Atomics - Reductions - Warp - Shuffle
No ratings yet
05 - Atomics - Reductions - Warp - Shuffle 05 - Atomics - Reductions - Warp - Shuffle
27 pages
s7122 Stephen Jones Cuda Optimization Tips Tricks and Techniques
No ratings yet
s7122 Stephen Jones Cuda Optimization Tips Tricks and Techniques
71 pages
Parallel Programming Module 4
No ratings yet
Parallel Programming Module 4
93 pages
Luong Thesis
No ratings yet
Luong Thesis
81 pages
Large-Scale Welding Process Simulation by GPU
No ratings yet
Large-Scale Welding Process Simulation by GPU
27 pages
CUDA Putting It All Together
No ratings yet
CUDA Putting It All Together
39 pages
Cuda
No ratings yet
Cuda
69 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
s21170 Cuda On Nvidia Ampere Gpu Architecture Taking Your Algorithms To The Next Level of Performance
No ratings yet
s21170 Cuda On Nvidia Ampere Gpu Architecture Taking Your Algorithms To The Next Level of Performance
49 pages
Parallel Programming Module 5
No ratings yet
Parallel Programming Module 5
24 pages
GTC S62191
No ratings yet
GTC S62191
89 pages
Cuda 9 and Beyond
100% (1)
Cuda 9 and Beyond
45 pages
Cuuda Nvidai Guide - Part1
No ratings yet
Cuuda Nvidai Guide - Part1
15 pages
Monte Carlo
No ratings yet
Monte Carlo
21 pages
GPU Architecture
No ratings yet
GPU Architecture
12 pages
Pipelining Size and Depth
No ratings yet
Pipelining Size and Depth
19 pages
Lecture 29 GPU Architecture Example
No ratings yet
Lecture 29 GPU Architecture Example
15 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
DeepSpeed Inference - Enabling Efficient Inference of Transformer Models at Unprecedented Scale
No ratings yet
DeepSpeed Inference - Enabling Efficient Inference of Transformer Models at Unprecedented Scale
13 pages
Lecture 2 PP
No ratings yet
Lecture 2 PP
15 pages
Lecture 1 Introduction PP
No ratings yet
Lecture 1 Introduction PP
14 pages
Cuda Smith Watermaan Speed Up
No ratings yet
Cuda Smith Watermaan Speed Up
7 pages
Histogram
No ratings yet
Histogram
11 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
44 pages
A New Approach For Parallel Region Growing Algorithm in Image Segmentation Using MATLAB On GPU Architecture
No ratings yet
A New Approach For Parallel Region Growing Algorithm in Image Segmentation Using MATLAB On GPU Architecture
5 pages
Cuda 101
No ratings yet
Cuda 101
53 pages
A Fast and Accurate Splitting Method For Optimal Transport: Analysis and Implementation
No ratings yet
A Fast and Accurate Splitting Method For Optimal Transport: Analysis and Implementation
24 pages
NVIDIA - Cooperative Groups - Slides - GTC 2017 (S7622-Kyrylo-Perelygin-Robust-And-Scalable-Cuda)
No ratings yet
NVIDIA - Cooperative Groups - Slides - GTC 2017 (S7622-Kyrylo-Perelygin-Robust-And-Scalable-Cuda)
56 pages
02 RTVis GPGPU CUDA
No ratings yet
02 RTVis GPGPU CUDA
34 pages
2023 CSC14120 Lecture00 CourseIntroduction
No ratings yet
2023 CSC14120 Lecture00 CourseIntroduction
30 pages
Optimizing Parallel Reduction in CUDA
No ratings yet
Optimizing Parallel Reduction in CUDA
38 pages
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
No ratings yet
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
23 pages
NVIDIA Ampere GPU Architecture Tuning Guide - Ampere Tuning Guide 12.3 Documentation
No ratings yet
NVIDIA Ampere GPU Architecture Tuning Guide - Ampere Tuning Guide 12.3 Documentation
5 pages
Programming For Performance
No ratings yet
Programming For Performance
79 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Multicore02 2
No ratings yet
Multicore02 2
18 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Tutorial No 3
No ratings yet
Tutorial No 3
2 pages
Assignment 3
No ratings yet
Assignment 3
2 pages
Parallel ProgrammingSyllabus
No ratings yet
Parallel ProgrammingSyllabus
2 pages
Unit 4
No ratings yet
Unit 4
48 pages
MegaKernel Blog
No ratings yet
MegaKernel Blog
11 pages
HPC Note
No ratings yet
HPC Note
39 pages
Parallel Computing
No ratings yet
Parallel Computing
91 pages
Class4 Advanced Cuda Opencl
No ratings yet
Class4 Advanced Cuda Opencl
64 pages
217 Lec1
No ratings yet
217 Lec1
35 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
GPU in Supercomputer
No ratings yet
GPU in Supercomputer
7 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
02 - Introduction To Concurrent Systems PDF
No ratings yet
02 - Introduction To Concurrent Systems PDF
31 pages
UNIT V Parallel Programming Patterns in CUDA (T2 Chapter 7) - P P With CUDA
No ratings yet
UNIT V Parallel Programming Patterns in CUDA (T2 Chapter 7) - P P With CUDA
35 pages
Lecture Week - 3 Amdahl Law 1
No ratings yet
Lecture Week - 3 Amdahl Law 1
19 pages
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
No ratings yet
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
28 pages
09 ParallelizationRecap PDF
No ratings yet
09 ParallelizationRecap PDF
62 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Barnett Haskins
No ratings yet
Barnett Haskins
29 pages
Cs344 - Lesson 2 - GPU Hardware and Parallel Communication Patterns - Udacity
No ratings yet
Cs344 - Lesson 2 - GPU Hardware and Parallel Communication Patterns - Udacity
14 pages
CUDA
No ratings yet
CUDA
33 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Introduction to Coding in Hours With Python Level 1: A Guide to Programming for Students With No Prior Experience (Learn Coding Basics With Python)
From Everand
Introduction to Coding in Hours With Python Level 1: A Guide to Programming for Students With No Prior Experience (Learn Coding Basics With Python)
Jack C. Stanely
No ratings yet

Sum Reduction Week10 Lec30

Uploaded by

Sum Reduction Week10 Lec30

Uploaded by

Parallel Reduction in CUDA

Common and important data parallel primitive

Easy to implement in CUDA

Serves as a great optimization example

Tree-based approach used within each thread block

Need to be able to use multiple thread blocks

Values (shared memory) 10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2

decomposing computation into multiple kernel

In the case of reductions, code for all levels is the

We should strive to reach GPU peak performance

You might also like