Ece408 Lecture13 Reduction Tree VK FL24
Ece408 Lecture13 Reduction Tree VK FL24
Lecture 13
Parallel Computation Patterns –
Reduction Trees
1
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Course Reminders
• Project Milestone 1
– Baseline CPU/GPU implementation is due this Friday
• Lab 5
– Will be released later this week
• Lectures
– No lecture next Tuesday Oct. 15th due to the midterm exam
– Lecture for Thursday Oct. 17th will be recorded and posted on-line for you to
watch on your own time
• Lab grading
– Please check all your lab (1-4) grades and make sure they are correct and correctly
recorded.
2
Midterm 1
• Midterm 1 on October 15th at 7pm in ECEB, see Canvas for details
– Handwritten notes are NOT permitted; we will provide a reference sheet, which
will be also posted on Canvas prior to the exam
– Exam will include multiple-choice questions, written response questions, fill in the
blanks questions, and coding questions
– Lectures 1-12 and "Application Profiling with Nsight" lecture, Lab 1-4, PM1
– Paper-based, see examples from prior semesters, particularly from SP24
– HKN review session: 3-5:30pm on Sunday in 1002 ECEB
3
A note about tools
• Kernel Profiling
– Old name: nv-nsight-cu-cli (mentioned in the lecture)
– New name: ncu (used in the project)
– Both commands are simply aliases and function identically, with the same usage
and behavior.
• Memory Checking
– Old name: cuda-memcheck (mentioned in both the lecture and the project)
– New name: compute-sanitizer
– cuda-memcheck and compute-sanitizer are distinct tools. The newer compute-
sanitizer integrates additional debugging features beyond memory checking.
However, when it comes to detecting memory errors, compute-sanitizer can serve
as a drop-in replacement for cuda-memcheck.
4
Objectives
• To learn the basic concept of reductions, one of the most
widely used parallel computation patterns
• To learn simple strategies for parallelization of reductions
• To understand the performance issues involved with
performing reductions on GPUs
5
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Important Enough to Use in Theory
“… scan operations, also known as prefix computations,
can execute in no more time than …
parallel memory references …
greatly simplify the description of many [parallel]
algorithms, and are
significantly easier to implement than
memory references.” —Guy Blelloch, 1989*
6
© Steven S. Lumetta, 2020
ECE408/CS483/ University of Illinois at Urbana-Champaign
Trying to Bridge Theory and Practice
A generic parallel algorithm,
• in which parallel threads access memory arbitrarily,
• is likely to produce an extremely slow access pattern.
Scans
• can be implemented quickly in hardware, and
• form a useful alternative to arbitrary memory accesses.
(His hope was to enable theory
without knowledge of microarchitecture.)
7
© Steven S. Lumetta, 2020
ECE408/CS483/ University of Illinois at Urbana-Champaign
Example Use: Summarizing Results
1. Start with a large set of things (examples: integers, social
networking user information)
2. Process each thing independently to produce some value
(examples: number of friends, timeline posts in last two weeks)
3. Summarize!
– Typically, with an associative
and commutative operation (+, *, min, max, …)
– since things in the set are
unordered and independent.
8
© Steven S. Lumetta, 2020
ECE408/CS483/ University of Illinois at Urbana-Champaign
Focus on Reduction Using a Tree
Pattern is so common that
• people have built frameworks around it!
• examples: Google and Hadoop MapReduce
Let’s focus on the summarization, called a reduction:
– no required order for processing the values (operator is associative
and commutative), so
– partition the data set into smaller chunks,
– have each thread to process a chunk, and
– use a tree to compute the final answer.
9
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Reduction Enables Parallelization
Reduction enables common parallel transformations.
10
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
What Exactly is a Reduction?
Reduce a set of inputs to a single value
• using a binary operator, such as
– sum, product, minimum, maximum,
• or a user-defined reduction operation
– must be associative and commutative
– and have an identity value (example: 0 for sum)
11
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Sequential Reduction is O(N)
Given binary operator ↔and an identity value I↔
• I↔= 0 for sum
• I↔= 1 for product
• I↔= largest possible value for min
• I↔= smallest possible value for max
result ← I↔
for each value X in input
result ← result ↔ X
12
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Example: Parallel Max Reduction in log(N) Steps
3 1 7 0 4 1 6 3
3 7 4 6
max max
7 6
max
7 13
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Tournaments Use Reduction with “max”
14
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Algorithm is Work Efficient
For N input values, the number of operations is
1 1 1 1 1
𝑁𝑁 + 𝑁𝑁 + 𝑁𝑁 + ⋯+ 𝑁𝑁 = 1− 𝑁𝑁 = 𝑵𝑵 − 𝟏𝟏.
2 4 8 𝑁𝑁 𝑁𝑁
15
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Fast if Enough Resources are Available
For N input values, the number of steps is log (N).
With enough execution resources,
• N=1,000,000 takes 20 steps!
• Sounds great!
How much parallelism do we need?
• On average, (N-1)/log(N).
50,000 in our example.
• But peak is N/2!
500,000 in our example.
16
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Diminishing Parallelism is Common
In our parallel reduction,
• the number of operations
• halves in every step.
This kind of narrowing parallelism is common
• from combinational logic circuits
• to basic blocks
• to high-performance applications.
17
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Parallel Strategy for CUDA
Let’s start simple: N values in device global memory.
Each thread block of M threads
• uses shared memory,
• to reduce chunk of 2M values to one value
• (2M << N to produce enough thread blocks).
Blocks operate within shared memory
• to reduce global memory traffic, and
• write one value back to global memory.
18
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
CUDA Reduction Algorithm
1. Read block of 2M values into shared memory.
19
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
A Simple Mapping of Data to Threads
Each thread
• begins with two adjacent locations (stride of 1),
• even index (first) and an odd index (second).
• Thread 0 gets 0 and 1, Thread 1 gets 2 and 3, …
• Write result back to the even index.
After each step,
• half of active threads are done.
• Double the stride.
At the end, result is at index 0.
20
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Naïve Data Mapping for a Reduction
Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5
Input 0 1 2 3 4 5 6 7 8 9 10 11
3 0..7 8..15
21
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
A Sum Example (Values Instead of Indices)
Thread 0 Thread 1 Thread 2 Thread 3
Input 3 1 7 0 4 1 6 3
1 4 7 5 9
Active Partial
2 11 14 Sum elements
Steps
3 25
22
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
The Reduction Steps
// Stride is distance to the next value being
// accumulated into the threads mapped position
// in the partialSum[] array
for (unsigned int stride = 1;
stride <= blockDim.x; stride *= 2)
{
__syncthreads();
if (t % stride == 0)
partialSum[2*t]+= partialSum[2*t+stride];
}
Why do we need __syncthreads()?
23
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Barrier Synchronization
__syncthreads() ensures
• all elements of partial sum generated
• before the next step uses them.
24
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Example Without __syncthreads
Thread 0 Thread 1 Thread 2 Thread 3
Input 3 1 7 0 4 1 6 3
(not done
in time)
1 4 7 5
9 Active Partial
2 11 11 Sum elements
Steps
3 22
Incorrect Partial
Sum elements
25
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Several Options after Blocks are Done
After all reduction steps, thread 0
• writes block’s sum from partialSum[0]
• into global vector indexed by blockIdx.x.
Vector has length N / 2M.
• If small, transfer vector to host
and sum it up on CPU.
• If large, launch kernel again (and again).
(Kernel can also accumulate to a global sum using atomic
operations, to be covered soon.)
26
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
“Segmented Reduction”
Global Memory
A list of arbitrary length
…
partialSum Block 0 partialSum Block 1 partialSum Block 2 partialSum Block n-1
27
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Analysis of Execution Resources
All threads active in the first step.
In all subsequent steps, two control flow paths:
• perform addition, or do nothing.
• Doing nothing still consumes execution resources.
At most half of threads perform addition after first step
• (all threads with odd indices disabled after first step).
• After fifth step, entire warps do nothing:
poor resource utilization, but no divergence.
• Active warps have only one active thread.
Up to five more steps (if limited to 1024 threads).
28
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Improve Performance by Reassigning Data
Can we do better?
Absolutely!
29
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
A Better Strategy
30
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Illustration with 16 Threads
Thread 0 Thread 1 Thread 2 Thread 14Thread 15
Input 0 1 2 3 … 13 14 15 16 17 18 19
1 0,16 15,31
2
Steps
31
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
A Better Reduction Kernel
32
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Again: Analysis of Execution Resources
Given 1024 threads,
• Block loads 2048 elements to shared memory.
• No branch divergence in the first six steps:
– 1024, 512, 256, 128, 64, and 32
consecutive threads active;
– threads in each warp either
all active or all inactive
• Last six steps have one active warp
(branch divergence for last five steps).
33
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Parallel Algorithm Overhead
__shared__ float partialSum[2*BLOCK_SIZE];
34
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Parallel Algorithm Overhead
__shared__ float partialSum[2*BLOCK_SIZE];
35
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Parallel Execution Overhead
3 1 7 0 4 1 6 3
+ + + +
Although the number of “operations” is N, each “operation
involves much more complex address calculation and
4 7 5 9
intermediate result manipulation.
+
7
36
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Further Improvements
Can we further improve reduction?
37
© Steven S. Lumetta, 2020
ECE408/CS483/ University of Illinois at Urbana-Champaign
Make Use of Shared Memory
How much shared memory are we using?
Each block of 1,024 threads reads 2,048 values.
• Let’s say two blocks per SM,
• so 16 kB ( = 2,048 × 2 × 4B ).
Could read 4,096 or 8,192 values
• (with 64 kB per SM)
• to slightly increase parallelism.
(For 48 kB per SM, use 6,144 values and have all threads do a 3-to-1
reduction before the current loop.)
38
© Steven S. Lumetta, 2020
ECE408/CS483/ University of Illinois at Urbana-Champaign
Eliminate the Narrow Parallelism
What about parallelism?
39
© Steven S. Lumetta, 2020
ECE408/CS483/ University of Illinois at Urbana-Champaign
Get Rid of the Overhead
Launching kernels is expensive.
• Why bother tearing down and setting up the same blocks on the same
SMs?
• Makes no sense.
• Remember that reduction operators are associative and commutative.
Let’s be compute-centric:
• put 2048 threads (as two blocks) on each SM, and
• just keep them there until we’re done!
40
© Steven S. Lumetta, 2020
ECE408/CS483/ University of Illinois at Urbana-Champaign
Work Until the Data is Exhausted!
Say there are 8 SMs, so 16 blocks.
41
© Steven S. Lumetta, 2020
ECE408/CS483/ University of Illinois at Urbana-Champaign
Caveat
I didn’t try these ideas.
42
© Steven S. Lumetta, 2020
ECE408/CS483/ University of Illinois at Urbana-Champaign
ANY MORE QUESTIONS?
READ CHAPTER 5
43
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Problem Solving
• Q: Consider a 3D video filtering (convolution) code in CUDA with
a 5x5x7 mask, which is stored in constant memory. Shared
memory is used to fully store the input tile required for an
16x16x16 output tile (for example, using strategy 2). What is the
ratio of total global memory read operations to shared memory
accesses for one output tile? For this question, only consider
interior tiles with no ghost elements.
• A:
– 20*20*22 to 5*5*7*16*16*16