0% found this document useful (0 votes)
15 views45 pages

Ece408 Lecture13 Reduction Tree VK FL24

The document outlines the lecture on Parallel Computation Patterns, specifically focusing on Reduction Trees in the Applied Parallel Programming course. It includes course reminders, midterm details, and objectives related to understanding reductions and their parallelization strategies. Additionally, it discusses the performance implications and algorithms for implementing reductions efficiently on GPUs.

Uploaded by

bigbigbarmaley
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views45 pages

Ece408 Lecture13 Reduction Tree VK FL24

The document outlines the lecture on Parallel Computation Patterns, specifically focusing on Reduction Trees in the Applied Parallel Programming course. It includes course reminders, midterm details, and objectives related to understanding reductions and their parallelization strategies. Additionally, it discusses the performance implications and algorithms for implementing reductions efficiently on GPUs.

Uploaded by

bigbigbarmaley
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

ECE408/CS483/CSE408 Fall 2024

Applied Parallel Programming

Lecture 13
Parallel Computation Patterns –
Reduction Trees

1
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Course Reminders
• Project Milestone 1
– Baseline CPU/GPU implementation is due this Friday
• Lab 5
– Will be released later this week
• Lectures
– No lecture next Tuesday Oct. 15th due to the midterm exam
– Lecture for Thursday Oct. 17th will be recorded and posted on-line for you to
watch on your own time
• Lab grading
– Please check all your lab (1-4) grades and make sure they are correct and correctly
recorded.
2
Midterm 1
• Midterm 1 on October 15th at 7pm in ECEB, see Canvas for details
– Handwritten notes are NOT permitted; we will provide a reference sheet, which
will be also posted on Canvas prior to the exam
– Exam will include multiple-choice questions, written response questions, fill in the
blanks questions, and coding questions
– Lectures 1-12 and "Application Profiling with Nsight" lecture, Lab 1-4, PM1
– Paper-based, see examples from prior semesters, particularly from SP24
– HKN review session: 3-5:30pm on Sunday in 1002 ECEB

3
A note about tools
• Kernel Profiling
– Old name: nv-nsight-cu-cli (mentioned in the lecture)
– New name: ncu (used in the project)
– Both commands are simply aliases and function identically, with the same usage
and behavior.
• Memory Checking
– Old name: cuda-memcheck (mentioned in both the lecture and the project)
– New name: compute-sanitizer
– cuda-memcheck and compute-sanitizer are distinct tools. The newer compute-
sanitizer integrates additional debugging features beyond memory checking.
However, when it comes to detecting memory errors, compute-sanitizer can serve
as a drop-in replacement for cuda-memcheck.
4
Objectives
• To learn the basic concept of reductions, one of the most
widely used parallel computation patterns
• To learn simple strategies for parallelization of reductions
• To understand the performance issues involved with
performing reductions on GPUs

5
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Important Enough to Use in Theory
“… scan operations, also known as prefix computations,
can execute in no more time than …
parallel memory references …
greatly simplify the description of many [parallel]
algorithms, and are
significantly easier to implement than
memory references.” —Guy Blelloch, 1989*

*G. Blelloch, “Scans as Primitive Parallel Operations,”


IEEE Transactions on Computers, 38(11):1526-1538, 1989.
The idea behind scans for computation goes back another 30+ years.

6
© Steven S. Lumetta, 2020
ECE408/CS483/ University of Illinois at Urbana-Champaign
Trying to Bridge Theory and Practice
A generic parallel algorithm,
• in which parallel threads access memory arbitrarily,
• is likely to produce an extremely slow access pattern.
Scans
• can be implemented quickly in hardware, and
• form a useful alternative to arbitrary memory accesses.
(His hope was to enable theory
without knowledge of microarchitecture.)

7
© Steven S. Lumetta, 2020
ECE408/CS483/ University of Illinois at Urbana-Champaign
Example Use: Summarizing Results
1. Start with a large set of things (examples: integers, social
networking user information)
2. Process each thing independently to produce some value
(examples: number of friends, timeline posts in last two weeks)
3. Summarize!
– Typically, with an associative
and commutative operation (+, *, min, max, …)
– since things in the set are
unordered and independent.

8
© Steven S. Lumetta, 2020
ECE408/CS483/ University of Illinois at Urbana-Champaign
Focus on Reduction Using a Tree
Pattern is so common that
• people have built frameworks around it!
• examples: Google and Hadoop MapReduce
Let’s focus on the summarization, called a reduction:
– no required order for processing the values (operator is associative
and commutative), so
– partition the data set into smaller chunks,
– have each thread to process a chunk, and
– use a tree to compute the final answer.

9
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Reduction Enables Parallelization
Reduction enables common parallel transformations.

example: privatization of output


• Loop iterations sum into a single output
(examples: inner loops in matrix multiply and convolution).
• To parallelize iterations, must
make private copies of the output!
• Use reduction to sum private copies
into the original output.

10
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
What Exactly is a Reduction?
Reduce a set of inputs to a single value
• using a binary operator, such as
– sum, product, minimum, maximum,
• or a user-defined reduction operation
– must be associative and commutative
– and have an identity value (example: 0 for sum)

Available in most parallel libraries as collective operations (like


barriers).

11
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Sequential Reduction is O(N)
Given binary operator ↔and an identity value I↔
• I↔= 0 for sum
• I↔= 1 for product
• I↔= largest possible value for min
• I↔= smallest possible value for max

result ← I↔
for each value X in input
result ← result ↔ X

12
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Example: Parallel Max Reduction in log(N) Steps
3 1 7 0 4 1 6 3

max max max max

3 7 4 6

max max

7 6
max

7 13
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Tournaments Use Reduction with “max”

(A more artful rendition of the reduction tree.)

14
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Algorithm is Work Efficient
For N input values, the number of operations is
1 1 1 1 1
𝑁𝑁 + 𝑁𝑁 + 𝑁𝑁 + ⋯+ 𝑁𝑁 = 1− 𝑁𝑁 = 𝑵𝑵 − 𝟏𝟏.
2 4 8 𝑁𝑁 𝑁𝑁

The parallel algorithm shown is work-efficient:


• requires the same amount of work
as a sequential algorithm
• (constant overheads, but nothing dependent on N).

15
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Fast if Enough Resources are Available
For N input values, the number of steps is log (N).
With enough execution resources,
• N=1,000,000 takes 20 steps!
• Sounds great!
How much parallelism do we need?
• On average, (N-1)/log(N).
50,000 in our example.
• But peak is N/2!
500,000 in our example.

16
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Diminishing Parallelism is Common
In our parallel reduction,
• the number of operations
• halves in every step.
This kind of narrowing parallelism is common
• from combinational logic circuits
• to basic blocks
• to high-performance applications.

CUDA kernels allow only a fixed number of threads.

17
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Parallel Strategy for CUDA
Let’s start simple: N values in device global memory.
Each thread block of M threads
• uses shared memory,
• to reduce chunk of 2M values to one value
• (2M << N to produce enough thread blocks).
Blocks operate within shared memory
• to reduce global memory traffic, and
• write one value back to global memory.

18
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
CUDA Reduction Algorithm
1. Read block of 2M values into shared memory.

2. For each of log(2M) steps,


– combine two values per thread in each step,
– write result to shared memory, and
– halve the number of active threads.

3. Write final result back to global memory.

19
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
A Simple Mapping of Data to Threads
Each thread
• begins with two adjacent locations (stride of 1),
• even index (first) and an odd index (second).
• Thread 0 gets 0 and 1, Thread 1 gets 2 and 3, …
• Write result back to the even index.
After each step,
• half of active threads are done.
• Double the stride.
At the end, result is at index 0.

20
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Naïve Data Mapping for a Reduction
Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5

Input 0 1 2 3 4 5 6 7 8 9 10 11

1 0...1 2...3 4...5 6...7 8...9 10...11

2 0...3 4..7 8..11


Steps

3 0..7 8..15

21
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
A Sum Example (Values Instead of Indices)
Thread 0 Thread 1 Thread 2 Thread 3

Input 3 1 7 0 4 1 6 3

1 4 7 5 9

Active Partial
2 11 14 Sum elements
Steps

3 25

22
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
The Reduction Steps
// Stride is distance to the next value being
// accumulated into the threads mapped position
// in the partialSum[] array
for (unsigned int stride = 1;
stride <= blockDim.x; stride *= 2)
{
__syncthreads();
if (t % stride == 0)
partialSum[2*t]+= partialSum[2*t+stride];
}
Why do we need __syncthreads()?
23
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Barrier Synchronization

__syncthreads() ensures
• all elements of partial sum generated
• before the next step uses them.

Why do we not need __syncthreads()


at the end of the reduction loop?

24
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Example Without __syncthreads
Thread 0 Thread 1 Thread 2 Thread 3

Input 3 1 7 0 4 1 6 3
(not done
in time)
1 4 7 5

9 Active Partial
2 11 11 Sum elements
Steps

3 22
Incorrect Partial
Sum elements
25
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Several Options after Blocks are Done
After all reduction steps, thread 0
• writes block’s sum from partialSum[0]
• into global vector indexed by blockIdx.x.
Vector has length N / 2M.
• If small, transfer vector to host
and sum it up on CPU.
• If large, launch kernel again (and again).
(Kernel can also accumulate to a global sum using atomic
operations, to be covered soon.)

26
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
“Segmented Reduction”
Global Memory
A list of arbitrary length

Shared Memory Shared Memory Shared Memory Shared Memory



partialSum Block 0 partialSum Block 1 partialSum Block 2 partialSum Block n-1


partialSum Block 0 partialSum Block 1 partialSum Block 2 partialSum Block n-1

partialSum Grid Global Memory


Copy back to host and host to finish the work.

27
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Analysis of Execution Resources
All threads active in the first step.
In all subsequent steps, two control flow paths:
• perform addition, or do nothing.
• Doing nothing still consumes execution resources.
At most half of threads perform addition after first step
• (all threads with odd indices disabled after first step).
• After fifth step, entire warps do nothing:
poor resource utilization, but no divergence.
• Active warps have only one active thread.
Up to five more steps (if limited to 1024 threads).

28
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Improve Performance by Reassigning Data

Can we do better?

Absolutely!

How we assign data to threads


makes a difference in some algorithms,
including reduction.

29
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
A Better Strategy

Let’s try this approach:


• in each step,
• compact the partial sums
• into the first locations
• in the partialSum array

Doing so keeps the active threads consecutive.

30
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Illustration with 16 Threads
Thread 0 Thread 1 Thread 2 Thread 14Thread 15

Input 0 1 2 3 … 13 14 15 16 17 18 19

1 0,16 15,31

2
Steps

31
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
A Better Reduction Kernel

for (unsigned int stride = blockDim.x;


stride >= 1; stride /= 2)
{
__syncthreads();
if (t < stride)
partialSum[t] += partialSum[t+stride];
}

32
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Again: Analysis of Execution Resources
Given 1024 threads,
• Block loads 2048 elements to shared memory.
• No branch divergence in the first six steps:
– 1024, 512, 256, 128, 64, and 32
consecutive threads active;
– threads in each warp either
all active or all inactive
• Last six steps have one active warp
(branch divergence for last five steps).

33
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Parallel Algorithm Overhead
__shared__ float partialSum[2*BLOCK_SIZE];

unsigned int t = threadIdx.x;


unsigned int start = 2*blockIdx.x*blockDim.x;
partialSum[t] = input[start + t];
partialSum[blockDim+t] = input[start+ blockDim.x+t];
for (unsigned int stride = blockDim.x;
stride >= 1; stride >>= 1)
{
__syncthreads();
if (t < stride)
partialSum[t] += partialSum[t+stride];
}

34
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Parallel Algorithm Overhead
__shared__ float partialSum[2*BLOCK_SIZE];

unsigned int t = threadIdx.x;


unsigned int start = 2*blockIdx.x*blockDim.x;
partialSum[t] = input[start + t];
partialSum[blockDim+t] = input[start+ blockDim.x+t];
for (unsigned int stride = blockDim.x;
stride >= 1; stride >>= 1)
{
__syncthreads();
if (t < stride)
partialSum[t] += partialSum[t+stride];
}

35
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Parallel Execution Overhead
3 1 7 0 4 1 6 3

+ + + +
Although the number of “operations” is N, each “operation
involves much more complex address calculation and
4 7 5 9
intermediate result manipulation.

If the parallel code+is executed on a single-thread


+ hardware,
it would be significantly slower than the code based on the
original sequential algorithm.
7 6

+
7
36
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Further Improvements
Can we further improve reduction?

The problem is memory-bound:


• one operation for every 4B value read;
• so focus on memory coalescing and
avoiding poor computational resource use.

37
© Steven S. Lumetta, 2020
ECE408/CS483/ University of Illinois at Urbana-Champaign
Make Use of Shared Memory
How much shared memory are we using?
Each block of 1,024 threads reads 2,048 values.
• Let’s say two blocks per SM,
• so 16 kB ( = 2,048 × 2 × 4B ).
Could read 4,096 or 8,192 values
• (with 64 kB per SM)
• to slightly increase parallelism.
(For 48 kB per SM, use 6,144 values and have all threads do a 3-to-1
reduction before the current loop.)

38
© Steven S. Lumetta, 2020
ECE408/CS483/ University of Illinois at Urbana-Champaign
Eliminate the Narrow Parallelism
What about parallelism?

Smaller blocks might seem attractive:


• when one warp is active,
• each SM has one warp per block.

But there are probably better ways. For example,


• stop reducing at 32 elements (or at 64, or 128), and
• hand off to the next kernel.

39
© Steven S. Lumetta, 2020
ECE408/CS483/ University of Illinois at Urbana-Champaign
Get Rid of the Overhead
Launching kernels is expensive.
• Why bother tearing down and setting up the same blocks on the same
SMs?
• Makes no sense.
• Remember that reduction operators are associative and commutative.
Let’s be compute-centric:
• put 2048 threads (as two blocks) on each SM, and
• just keep them there until we’re done!

40
© Steven S. Lumetta, 2020
ECE408/CS483/ University of Illinois at Urbana-Champaign
Work Until the Data is Exhausted!
Say there are 8 SMs, so 16 blocks.

1. Divide the whole dataset into 16 chunks.


2. Read enough to fill shared memory.
3. Compute … only until some threads not needed.
4. Then load more data!
5. Repeat until the data are exhausted,
6. THEN let parallelism drop.
(Gather 16 values on host and reduce them.)

41
© Steven S. Lumetta, 2020
ECE408/CS483/ University of Illinois at Urbana-Champaign
Caveat
I didn’t try these ideas.

I’ll leave them


• for those of you who feel motivated
• to try in Lab 5.
Do
• save a copy of your simpler solution, though, as
• you will need the partial sums for scan (next Lab).

42
© Steven S. Lumetta, 2020
ECE408/CS483/ University of Illinois at Urbana-Champaign
ANY MORE QUESTIONS?
READ CHAPTER 5
43
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Problem Solving
• Q: Consider a 3D video filtering (convolution) code in CUDA with
a 5x5x7 mask, which is stored in constant memory. Shared
memory is used to fully store the input tile required for an
16x16x16 output tile (for example, using strategy 2). What is the
ratio of total global memory read operations to shared memory
accesses for one output tile? For this question, only consider
interior tiles with no ghost elements.
• A:
– 20*20*22 to 5*5*7*16*16*16

© Volodymyr Kindratenko, 2024 ECE408/CS483, 44


University of Illinois, Urbana-Champaign
Problem Solving
• Q: Consider a convolutional neural network that takes 100x200 images with 3 color
channels (red, green, blue). The first layer of this network generates 10 output feature
maps using 9x9 filters, where all channels are combined in each output feature map.
Assuming all convolutions are performed in floating point, and considering only the
convolutional layer (e.g., no pooling, thresholding, non-linearity, etc.), how many floating-
point operations (both multiplications and additions) are required to generate all the
output feature maps in a single forward pass? Remember: output feature maps are smaller
than input maps because only pixels without ghost elements are generated.
• A:
10 (output feature maps)
* 92*192 (total # of pixels to compute in each output feature map)
* 9*9 (conv. filter size)
* 3 (color channels)
* 2 (multiply-add operations)

© Volodymyr Kindratenko, 2024 ECE408/CS483, 45


University of Illinois, Urbana-Champaign

You might also like