0% found this document useful (0 votes)

15 views45 pages

Ece408 Lecture13 Reduction Tree VK FL24

The document outlines the lecture on Parallel Computation Patterns, specifically focusing on Reduction Trees in the Applied Parallel Programming course. It includes course reminders, midterm details, and objectives related to understanding reductions and their parallelization strategies. Additionally, it discusses the performance implications and algorithms for implementing reductions efficiently on GPUs.

Uploaded by

bigbigbarmaley

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views45 pages

Ece408 Lecture13 Reduction Tree VK FL24

Uploaded by

bigbigbarmaley

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

ECE408/CS483/CSE408 Fall 2024

Applied Parallel Programming

Lecture 13
Parallel Computation Patterns –
Reduction Trees

1
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Course Reminders
• Project Milestone 1
– Baseline CPU/GPU implementation is due this Friday
• Lab 5
– Will be released later this week
• Lectures
– No lecture next Tuesday Oct. 15th due to the midterm exam
– Lecture for Thursday Oct. 17th will be recorded and posted on-line for you to
watch on your own time
• Lab grading
– Please check all your lab (1-4) grades and make sure they are correct and correctly
recorded.
2
Midterm 1
• Midterm 1 on October 15th at 7pm in ECEB, see Canvas for details
– Handwritten notes are NOT permitted; we will provide a reference sheet, which
will be also posted on Canvas prior to the exam
– Exam will include multiple-choice questions, written response questions, fill in the
blanks questions, and coding questions
– Lectures 1-12 and "Application Profiling with Nsight" lecture, Lab 1-4, PM1
– Paper-based, see examples from prior semesters, particularly from SP24
– HKN review session: 3-5:30pm on Sunday in 1002 ECEB

3
A note about tools
• Kernel Profiling
– Old name: nv-nsight-cu-cli (mentioned in the lecture)
– New name: ncu (used in the project)
– Both commands are simply aliases and function identically, with the same usage
and behavior.
• Memory Checking
– Old name: cuda-memcheck (mentioned in both the lecture and the project)
– New name: compute-sanitizer
– cuda-memcheck and compute-sanitizer are distinct tools. The newer compute-
sanitizer integrates additional debugging features beyond memory checking.
However, when it comes to detecting memory errors, compute-sanitizer can serve
as a drop-in replacement for cuda-memcheck.
4
Objectives
• To learn the basic concept of reductions, one of the most
widely used parallel computation patterns
• To learn simple strategies for parallelization of reductions
• To understand the performance issues involved with
performing reductions on GPUs

5
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Important Enough to Use in Theory
“… scan operations, also known as prefix computations,
can execute in no more time than …
parallel memory references …
greatly simplify the description of many [parallel]
algorithms, and are
significantly easier to implement than
memory references.” —Guy Blelloch, 1989*

*G. Blelloch, “Scans as Primitive Parallel Operations,”

IEEE Transactions on Computers, 38(11):1526-1538, 1989.
The idea behind scans for computation goes back another 30+ years.

6
© Steven S. Lumetta, 2020
ECE408/CS483/ University of Illinois at Urbana-Champaign
Trying to Bridge Theory and Practice
A generic parallel algorithm,
• in which parallel threads access memory arbitrarily,
• is likely to produce an extremely slow access pattern.
Scans
• can be implemented quickly in hardware, and
• form a useful alternative to arbitrary memory accesses.
(His hope was to enable theory
without knowledge of microarchitecture.)

7
© Steven S. Lumetta, 2020
ECE408/CS483/ University of Illinois at Urbana-Champaign
Example Use: Summarizing Results
1. Start with a large set of things (examples: integers, social
networking user information)
2. Process each thing independently to produce some value
(examples: number of friends, timeline posts in last two weeks)
3. Summarize!
– Typically, with an associative
and commutative operation (+, *, min, max, …)
– since things in the set are
unordered and independent.

8
© Steven S. Lumetta, 2020
ECE408/CS483/ University of Illinois at Urbana-Champaign
Focus on Reduction Using a Tree
Pattern is so common that
• people have built frameworks around it!
• examples: Google and Hadoop MapReduce
Let’s focus on the summarization, called a reduction:
– no required order for processing the values (operator is associative
and commutative), so
– partition the data set into smaller chunks,
– have each thread to process a chunk, and
– use a tree to compute the final answer.

9
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Reduction Enables Parallelization
Reduction enables common parallel transformations.

example: privatization of output

• Loop iterations sum into a single output
(examples: inner loops in matrix multiply and convolution).
• To parallelize iterations, must
make private copies of the output!
• Use reduction to sum private copies
into the original output.

10
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
What Exactly is a Reduction?
Reduce a set of inputs to a single value
• using a binary operator, such as
– sum, product, minimum, maximum,
• or a user-defined reduction operation
– must be associative and commutative
– and have an identity value (example: 0 for sum)

Available in most parallel libraries as collective operations (like

barriers).

11
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Sequential Reduction is O(N)
Given binary operator ↔and an identity value I↔
• I↔= 0 for sum
• I↔= 1 for product
• I↔= largest possible value for min
• I↔= smallest possible value for max

result ← I↔
for each value X in input
result ← result ↔ X

12
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Example: Parallel Max Reduction in log(N) Steps
3 1 7 0 4 1 6 3

max max max max

3 7 4 6

max max

7 6
max

7 13
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Tournaments Use Reduction with “max”

(A more artful rendition of the reduction tree.)

14
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Algorithm is Work Efficient
For N input values, the number of operations is
1 1 1 1 1
𝑁𝑁 + 𝑁𝑁 + 𝑁𝑁 + ⋯+ 𝑁𝑁 = 1− 𝑁𝑁 = 𝑵𝑵 − 𝟏𝟏.
2 4 8 𝑁𝑁 𝑁𝑁

The parallel algorithm shown is work-efficient:

• requires the same amount of work
as a sequential algorithm
• (constant overheads, but nothing dependent on N).

15
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Fast if Enough Resources are Available
For N input values, the number of steps is log (N).
With enough execution resources,
• N=1,000,000 takes 20 steps!
• Sounds great!
How much parallelism do we need?
• On average, (N-1)/log(N).
50,000 in our example.
• But peak is N/2!
500,000 in our example.

16
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Diminishing Parallelism is Common
In our parallel reduction,
• the number of operations
• halves in every step.
This kind of narrowing parallelism is common
• from combinational logic circuits
• to basic blocks
• to high-performance applications.

CUDA kernels allow only a fixed number of threads.

17
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Parallel Strategy for CUDA
Let’s start simple: N values in device global memory.
Each thread block of M threads
• uses shared memory,
• to reduce chunk of 2M values to one value
• (2M << N to produce enough thread blocks).
Blocks operate within shared memory
• to reduce global memory traffic, and
• write one value back to global memory.

18
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
CUDA Reduction Algorithm
1. Read block of 2M values into shared memory.

2. For each of log(2M) steps,

– combine two values per thread in each step,
– write result to shared memory, and
– halve the number of active threads.

3. Write final result back to global memory.

19
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
A Simple Mapping of Data to Threads
Each thread
• begins with two adjacent locations (stride of 1),
• even index (first) and an odd index (second).
• Thread 0 gets 0 and 1, Thread 1 gets 2 and 3, …
• Write result back to the even index.
After each step,
• half of active threads are done.
• Double the stride.
At the end, result is at index 0.

20
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Naïve Data Mapping for a Reduction
Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5

Input 0 1 2 3 4 5 6 7 8 9 10 11

1 0...1 2...3 4...5 6...7 8...9 10...11

2 0...3 4..7 8..11

Steps

3 0..7 8..15

21
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
A Sum Example (Values Instead of Indices)
Thread 0 Thread 1 Thread 2 Thread 3

Input 3 1 7 0 4 1 6 3

1 4 7 5 9

Active Partial
2 11 14 Sum elements
Steps

3 25

22
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
The Reduction Steps
// Stride is distance to the next value being
// accumulated into the threads mapped position
// in the partialSum[] array
for (unsigned int stride = 1;
stride <= blockDim.x; stride *= 2)
{
__syncthreads();
if (t % stride == 0)
partialSum[2*t]+= partialSum[2*t+stride];
}
Why do we need __syncthreads()?
23
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Barrier Synchronization

__syncthreads() ensures
• all elements of partial sum generated
• before the next step uses them.

Why do we not need __syncthreads()

at the end of the reduction loop?

24
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Example Without __syncthreads
Thread 0 Thread 1 Thread 2 Thread 3

Input 3 1 7 0 4 1 6 3
(not done
in time)
1 4 7 5

9 Active Partial
2 11 11 Sum elements
Steps

3 22
Incorrect Partial
Sum elements
25
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Several Options after Blocks are Done
After all reduction steps, thread 0
• writes block’s sum from partialSum[0]
• into global vector indexed by blockIdx.x.
Vector has length N / 2M.
• If small, transfer vector to host
and sum it up on CPU.
• If large, launch kernel again (and again).
(Kernel can also accumulate to a global sum using atomic
operations, to be covered soon.)

26
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
“Segmented Reduction”
Global Memory
A list of arbitrary length

Shared Memory Shared Memory Shared Memory Shared Memory

…
partialSum Block 0 partialSum Block 1 partialSum Block 2 partialSum Block n-1

partialSum Grid Global Memory

Copy back to host and host to finish the work.

27
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Analysis of Execution Resources
All threads active in the first step.
In all subsequent steps, two control flow paths:
• perform addition, or do nothing.
• Doing nothing still consumes execution resources.
At most half of threads perform addition after first step
• (all threads with odd indices disabled after first step).
• After fifth step, entire warps do nothing:
poor resource utilization, but no divergence.
• Active warps have only one active thread.
Up to five more steps (if limited to 1024 threads).

28
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Improve Performance by Reassigning Data

Can we do better?

Absolutely!

How we assign data to threads

makes a difference in some algorithms,
including reduction.

29
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
A Better Strategy

Let’s try this approach:

• in each step,
• compact the partial sums
• into the first locations
• in the partialSum array

Doing so keeps the active threads consecutive.

30
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Illustration with 16 Threads
Thread 0 Thread 1 Thread 2 Thread 14Thread 15

Input 0 1 2 3 … 13 14 15 16 17 18 19

1 0,16 15,31

2
Steps

31
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
A Better Reduction Kernel

for (unsigned int stride = blockDim.x;

stride >= 1; stride /= 2)
{
__syncthreads();
if (t < stride)
partialSum[t] += partialSum[t+stride];
}

32
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Again: Analysis of Execution Resources
Given 1024 threads,
• Block loads 2048 elements to shared memory.
• No branch divergence in the first six steps:
– 1024, 512, 256, 128, 64, and 32
consecutive threads active;
– threads in each warp either
all active or all inactive
• Last six steps have one active warp
(branch divergence for last five steps).

33
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Parallel Algorithm Overhead
__shared__ float partialSum[2*BLOCK_SIZE];

unsigned int t = threadIdx.x;

unsigned int start = 2*blockIdx.x*blockDim.x;
partialSum[t] = input[start + t];
partialSum[blockDim+t] = input[start+ blockDim.x+t];
for (unsigned int stride = blockDim.x;
stride >= 1; stride >>= 1)
{
__syncthreads();
if (t < stride)
partialSum[t] += partialSum[t+stride];
}

34
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Parallel Algorithm Overhead
__shared__ float partialSum[2*BLOCK_SIZE];

unsigned int t = threadIdx.x;

35
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Parallel Execution Overhead
3 1 7 0 4 1 6 3

+ + + +
Although the number of “operations” is N, each “operation
involves much more complex address calculation and
4 7 5 9
intermediate result manipulation.

If the parallel code+is executed on a single-thread

+ hardware,
it would be significantly slower than the code based on the
original sequential algorithm.
7 6

+
7
36
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Further Improvements
Can we further improve reduction?

The problem is memory-bound:

• one operation for every 4B value read;
• so focus on memory coalescing and
avoiding poor computational resource use.

37
© Steven S. Lumetta, 2020
ECE408/CS483/ University of Illinois at Urbana-Champaign
Make Use of Shared Memory
How much shared memory are we using?
Each block of 1,024 threads reads 2,048 values.
• Let’s say two blocks per SM,
• so 16 kB ( = 2,048 × 2 × 4B ).
Could read 4,096 or 8,192 values
• (with 64 kB per SM)
• to slightly increase parallelism.
(For 48 kB per SM, use 6,144 values and have all threads do a 3-to-1
reduction before the current loop.)

38
© Steven S. Lumetta, 2020
ECE408/CS483/ University of Illinois at Urbana-Champaign
Eliminate the Narrow Parallelism
What about parallelism?

Smaller blocks might seem attractive:

• when one warp is active,
• each SM has one warp per block.

But there are probably better ways. For example,

• stop reducing at 32 elements (or at 64, or 128), and
• hand off to the next kernel.

39
© Steven S. Lumetta, 2020
ECE408/CS483/ University of Illinois at Urbana-Champaign
Get Rid of the Overhead
Launching kernels is expensive.
• Why bother tearing down and setting up the same blocks on the same
SMs?
• Makes no sense.
• Remember that reduction operators are associative and commutative.
Let’s be compute-centric:
• put 2048 threads (as two blocks) on each SM, and
• just keep them there until we’re done!

40
© Steven S. Lumetta, 2020
ECE408/CS483/ University of Illinois at Urbana-Champaign
Work Until the Data is Exhausted!
Say there are 8 SMs, so 16 blocks.

1. Divide the whole dataset into 16 chunks.

2. Read enough to fill shared memory.
3. Compute … only until some threads not needed.
4. Then load more data!
5. Repeat until the data are exhausted,
6. THEN let parallelism drop.
(Gather 16 values on host and reduce them.)

I’ll leave them

• for those of you who feel motivated
• to try in Lab 5.
Do
• save a copy of your simpler solution, though, as
• you will need the partial sums for scan (next Lab).

42
© Steven S. Lumetta, 2020
ECE408/CS483/ University of Illinois at Urbana-Champaign
ANY MORE QUESTIONS?
READ CHAPTER 5
43
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Problem Solving
• Q: Consider a 3D video filtering (convolution) code in CUDA with
a 5x5x7 mask, which is stored in constant memory. Shared
memory is used to fully store the input tile required for an
16x16x16 output tile (for example, using strategy 2). What is the
ratio of total global memory read operations to shared memory
accesses for one output tile? For this question, only consider
interior tiles with no ghost elements.
• A:
– 20*20*22 to 5*5*7*16*16*16

University of Illinois, Urbana-Champaign
Problem Solving
• Q: Consider a convolutional neural network that takes 100x200 images with 3 color
channels (red, green, blue). The first layer of this network generates 10 output feature
maps using 9x9 filters, where all channels are combined in each output feature map.
Assuming all convolutions are performed in floating point, and considering only the
convolutional layer (e.g., no pooling, thresholding, non-linearity, etc.), how many floating-
point operations (both multiplications and additions) are required to generate all the
output feature maps in a single forward pass? Remember: output feature maps are smaller
than input maps because only pixels without ghost elements are generated.
• A:
10 (output feature maps)
* 92*192 (total # of pixels to compute in each output feature map)
* 9*9 (conv. filter size)
* 3 (color channels)
* 2 (multiply-add operations)

University of Illinois, Urbana-Champaign

Gpu Parallel Program Development Cuda
100% (2)
Gpu Parallel Program Development Cuda
477 pages
Introduction To Parallel Computing CIS 410/510 Department of Computer and Information Science
No ratings yet
Introduction To Parallel Computing CIS 410/510 Department of Computer and Information Science
65 pages
Warp Shuffles, Reduction and Scan Operations: Prof Wes Armour Wes - Armour@eng - Ox.ac - Uk
No ratings yet
Warp Shuffles, Reduction and Scan Operations: Prof Wes Armour Wes - Armour@eng - Ox.ac - Uk
41 pages
Parallelizing The Standard Algorithms Library - Jared Hoberock - CppCon 2014
No ratings yet
Parallelizing The Standard Algorithms Library - Jared Hoberock - CppCon 2014
58 pages
Lecture 10
No ratings yet
Lecture 10
40 pages
14 Parallel Algorithms CUDA Basics s20
No ratings yet
14 Parallel Algorithms CUDA Basics s20
89 pages
Decomposing A Problem For Parallel Execution - Pablo Halpern - CppCon 2014
No ratings yet
Decomposing A Problem For Parallel Execution - Pablo Halpern - CppCon 2014
48 pages
04 Progbasics
No ratings yet
04 Progbasics
43 pages
Reduction
No ratings yet
Reduction
91 pages
COL380: Introduction To Parallel & Distributed Programming
No ratings yet
COL380: Introduction To Parallel & Distributed Programming
20 pages
217 Lec10
No ratings yet
217 Lec10
27 pages
Parallel Computing Unit 3 - Principles of Parallel Computing Design
No ratings yet
Parallel Computing Unit 3 - Principles of Parallel Computing Design
78 pages
08 Dataparallel
No ratings yet
08 Dataparallel
51 pages
2023 CSC14120 Lecture04 CUDAParallelExecution (P2)
No ratings yet
2023 CSC14120 Lecture04 CUDAParallelExecution (P2)
24 pages
Optimizing Parallel Reduction in CUDA
No ratings yet
Optimizing Parallel Reduction in CUDA
38 pages
ECE408 MT2 Review FA24
No ratings yet
ECE408 MT2 Review FA24
58 pages
Parallel Algorithms: Theory and Practice
No ratings yet
Parallel Algorithms: Theory and Practice
44 pages
Cuda 9 and Beyond
100% (1)
Cuda 9 and Beyond
45 pages
Comp322 s19 Lec01 Slides v1 PDF
No ratings yet
Comp322 s19 Lec01 Slides v1 PDF
17 pages
Sum Reduction Week10 Lec30
No ratings yet
Sum Reduction Week10 Lec30
8 pages
Parallel Prefix Sum (Scan) With CUDA: Mark Harris
No ratings yet
Parallel Prefix Sum (Scan) With CUDA: Mark Harris
21 pages
EE664: Introduction To Parallel Computing: Dr. Gaurav Trivedi Lectures 5-14
No ratings yet
EE664: Introduction To Parallel Computing: Dr. Gaurav Trivedi Lectures 5-14
170 pages
CPU Execution
No ratings yet
CPU Execution
12 pages
Data - Parallel Algorithms On Gpus
No ratings yet
Data - Parallel Algorithms On Gpus
31 pages
Perspective On Parallel Programming: CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley
No ratings yet
Perspective On Parallel Programming: CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley
42 pages
Parallel Programming
No ratings yet
Parallel Programming
42 pages
MCQ 6 Probability & Index Numbers
No ratings yet
MCQ 6 Probability & Index Numbers
5 pages
Chapter Parallel Prefix Sum
No ratings yet
Chapter Parallel Prefix Sum
21 pages
Parallel Algorithms: Theory and Practice: Deterministi C Parallelism
No ratings yet
Parallel Algorithms: Theory and Practice: Deterministi C Parallelism
51 pages
25-04 Gpu Programming Without Cuda
No ratings yet
25-04 Gpu Programming Without Cuda
38 pages
05 - Atomics - Reductions - Warp - Shuffle 05 - Atomics - Reductions - Warp - Shuffle
No ratings yet
05 - Atomics - Reductions - Warp - Shuffle 05 - Atomics - Reductions - Warp - Shuffle
27 pages
HPC Parallel
No ratings yet
HPC Parallel
122 pages
Class4 Advanced Cuda Opencl
No ratings yet
Class4 Advanced Cuda Opencl
64 pages
L19-20 PA Design Intro
No ratings yet
L19-20 PA Design Intro
31 pages
01 Introduction
No ratings yet
01 Introduction
41 pages
Reduction
No ratings yet
Reduction
9 pages
CS4230 Parallel Programming Introduction To Parallel Algorithms
No ratings yet
CS4230 Parallel Programming Introduction To Parallel Algorithms
25 pages
PP Cuda Unit1 1
No ratings yet
PP Cuda Unit1 1
77 pages
04 Progbasics
No ratings yet
04 Progbasics
62 pages
PP Assignment
No ratings yet
PP Assignment
6 pages
Simulating Ocean Currents
No ratings yet
Simulating Ocean Currents
35 pages
Chapter 1
No ratings yet
Chapter 1
47 pages
09 ParallelizationRecap PDF
No ratings yet
09 ParallelizationRecap PDF
62 pages
3-Parallel Software
No ratings yet
3-Parallel Software
35 pages
High Performance Computing
No ratings yet
High Performance Computing
8 pages
Lecture Parallelism DC PDF
No ratings yet
Lecture Parallelism DC PDF
7 pages
E - Notes - HPC-Unit 3-1
No ratings yet
E - Notes - HPC-Unit 3-1
26 pages
Multicore02 2
No ratings yet
Multicore02 2
18 pages
001 - DDS IIIT Jan 10th
No ratings yet
001 - DDS IIIT Jan 10th
34 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
UNIT V Parallel Programming Patterns in CUDA (T2 Chapter 7) - P P With CUDA
No ratings yet
UNIT V Parallel Programming Patterns in CUDA (T2 Chapter 7) - P P With CUDA
35 pages
02 - Lecture #2
No ratings yet
02 - Lecture #2
29 pages
PDC 3
No ratings yet
PDC 3
26 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
Cs344 - Lesson 2 - GPU Hardware and Parallel Communication Patterns - Udacity
No ratings yet
Cs344 - Lesson 2 - GPU Hardware and Parallel Communication Patterns - Udacity
14 pages
Overview of Parallel Programming in C++ - Pablo Halpern - CppCon 2014
No ratings yet
Overview of Parallel Programming in C++ - Pablo Halpern - CppCon 2014
37 pages
.Trashed-1650000204-Hpc Prac Exam
No ratings yet
.Trashed-1650000204-Hpc Prac Exam
5 pages
Technology Management Tools: S-Curve
No ratings yet
Technology Management Tools: S-Curve
18 pages
0606 Add Maths - C16 Kinematics P1
No ratings yet
0606 Add Maths - C16 Kinematics P1
21 pages
Cuda Program + Wait For User Input
No ratings yet
Cuda Program + Wait For User Input
2 pages
Design Research of Railway Bridges With Span Length Over 1000m in China
No ratings yet
Design Research of Railway Bridges With Span Length Over 1000m in China
6 pages
Mass, Weight and Gravity
No ratings yet
Mass, Weight and Gravity
1 page
Chapter 1 - Introduction - 2023 - Programming Massively Parallel Processors
No ratings yet
Chapter 1 - Introduction - 2023 - Programming Massively Parallel Processors
20 pages
2G Kpi
No ratings yet
2G Kpi
61 pages
Fractions, Decimals, and Percentage in Real-Life (With Answer Key)
No ratings yet
Fractions, Decimals, and Percentage in Real-Life (With Answer Key)
38 pages
Quantization and Compression PDF
No ratings yet
Quantization and Compression PDF
220 pages
ML - 8
No ratings yet
ML - 8
70 pages
Discrete Mathematics Notes - 1
No ratings yet
Discrete Mathematics Notes - 1
24 pages
Unit 4
No ratings yet
Unit 4
27 pages
Parabolas (All Lectures)
No ratings yet
Parabolas (All Lectures)
8 pages
Assignment-1 QT
No ratings yet
Assignment-1 QT
3 pages
Chapter 3 Multidimensional Grids A 2023 Programming Massively Parallel Pro
No ratings yet
Chapter 3 Multidimensional Grids A 2023 Programming Massively Parallel Pro
22 pages
Chapter 5 Memory Architecture and Dat 2023 Programming Massively Parallel
No ratings yet
Chapter 5 Memory Architecture and Dat 2023 Programming Massively Parallel
29 pages
A1emac6814 2025
No ratings yet
A1emac6814 2025
6 pages
Chapter 4 Compute Architecture and S 2023 Programming Massively Parallel P
No ratings yet
Chapter 4 Compute Architecture and S 2023 Programming Massively Parallel P
24 pages
Chapter 2 - Heterogeneous Data Parallel - 2023 - Programming Massively Parallel
No ratings yet
Chapter 2 - Heterogeneous Data Parallel - 2023 - Programming Massively Parallel
24 pages
Schenk 2010
No ratings yet
Schenk 2010
16 pages
Chapter 19 - Sight Reductions
No ratings yet
Chapter 19 - Sight Reductions
23 pages
Permutations and Combination
No ratings yet
Permutations and Combination
26 pages
210 Determining End Ring Resistance and Inductance of Squirrel Cage For Induction Motor With 2D and 3D Computations
No ratings yet
210 Determining End Ring Resistance and Inductance of Squirrel Cage For Induction Motor With 2D and 3D Computations
6 pages
Hans Lewy
No ratings yet
Hans Lewy
5 pages
Resnick 1973
No ratings yet
Resnick 1973
31 pages
The Use of Adaptive Finite-Element Limit Analysis To Reveal Slip Line
No ratings yet
The Use of Adaptive Finite-Element Limit Analysis To Reveal Slip Line
7 pages
A Review of Condition Monitoring and Fault Diagnosis For Diesel Engines
No ratings yet
A Review of Condition Monitoring and Fault Diagnosis For Diesel Engines
25 pages
Sample DOKA Paper U For Year 11 13
No ratings yet
Sample DOKA Paper U For Year 11 13
4 pages
SVM Handout
No ratings yet
SVM Handout
9 pages
Digital Communications Over Fading Channels M.K. Simon and M.S. Alouini 2005 Book Review
No ratings yet
Digital Communications Over Fading Channels M.K. Simon and M.S. Alouini 2005 Book Review
2 pages
06.trencher TR 2700 Daily Report Petroserv August 2022
No ratings yet
06.trencher TR 2700 Daily Report Petroserv August 2022
2 pages
Worksheet 1 - Graph of Motion
No ratings yet
Worksheet 1 - Graph of Motion
2 pages
Face Recognition Line Edge Maps
No ratings yet
Face Recognition Line Edge Maps
9 pages
ROB100 - Materials List - Excel 2010 File
No ratings yet
ROB100 - Materials List - Excel 2010 File
2 pages
PG 589
No ratings yet
PG 589
1 page
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
From Everand
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
e3
No ratings yet
Learn Multithreading with Modern C++
From Everand
Learn Multithreading with Modern C++
James Raynard
No ratings yet

Ece408 Lecture13 Reduction Tree VK FL24

Uploaded by

Ece408 Lecture13 Reduction Tree VK FL24

Uploaded by

ECE408/CS483/CSE408 Fall 2024

Applied Parallel Programming

*G. Blelloch, “Scans as Primitive Parallel Operations,”

example: privatization of output

Available in most parallel libraries as collective operations (like

max max max max

(A more artful rendition of the reduction tree.)

The parallel algorithm shown is work-efficient:

CUDA kernels allow only a fixed number of threads.

2. For each of log(2M) steps,

3. Write final result back to global memory.

1 0...1 2...3 4...5 6...7 8...9 10...11

2 0...3 4..7 8..11

Why do we not need __syncthreads()

Shared Memory Shared Memory Shared Memory Shared Memory

partialSum Grid Global Memory

How we assign data to threads

Let’s try this approach:

Doing so keeps the active threads consecutive.

for (unsigned int stride = blockDim.x;

unsigned int t = threadIdx.x;

unsigned int t = threadIdx.x;

If the parallel code+is executed on a single-thread

The problem is memory-bound:

Smaller blocks might seem attractive:

But there are probably better ways. For example,

1. Divide the whole dataset into 16 chunks.

I’ll leave them

© Volodymyr Kindratenko, 2024 ECE408/CS483, 44

© Volodymyr Kindratenko, 2024 ECE408/CS483, 45

You might also like