ECE 408/CS 483 MIDTERM 2 REVIEW
Electrical & Computer Engineering
Fall 2024
HRISHI SHAH
12/08/2024
EXAM DETAILS
• Tuesday, December 10th, 2024, from 7:00 PM to 10:00 PM
• Closed book exam and no cheat sheets allowed
• Coverage:
o All lectures, including guest lectures
o All labs
o All project milestones
o Focused on (but not limited to!) material not in
Exam 1
ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING
HOW TO PREPARE
• Reread lectures and write a few takeaways (also attempt end of lecture questions)
• Read through textbook for more difficult topics
• NVIDIA Documentation for basic parallel programming ideas
• Check your mistakes on previous labs and quizzes
• Make sure to understand your code, even if its from lecture
• Past exams (several provided in files section of Canvas)
• Make your own review sheet (can’t use on exam)
• Review MT1 closely ( )
ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING
KEY TOPICS
• Reduction
• Scan
• Histograms
• Shared Memory and Synchronization
• Sparse Matrix Multiplication
• Project (Streaming!!)
• GPU Architecture
• Scalable Computing
• Accelerator APIs and Tensor Cores
ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING
REDUCTION
ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING
REDUCTION TREES
How can we reduce a large set of data to a
single value?
We want operations that do not depend on
other memory spaces (need to be parallelized).
The operators must then be commutative (1)
and associative (2), meaning they have the
same results regardless of
1 - ordering
2 - grouping But what benefit does parallel
reduction have?
ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING
NAÏVE REDUCTION
• 8 Elements
• 4 Threads
• Stride starts at 1, and multiplies by 2 at
each iteration
• When does stride stop?
• What are some problems with this?
ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING
NAÏVE REDUCTION PROBLEM
Like using shared memory for matrix
multiplication, convolution, etc., we need to
ensure that all thread has completed and
stored its value before proceeding.
Need syncthreads(), can this slow down our
kernel?
ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING
BETTER REDUCTION
Compact the partial sums and store them at
continuous regions of memory.
“Turn off” the other half of the threads at
each iteration
Why is this better? The amount of work is the
same?
Compare control divergence between the
two implementations
ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING
REDUCTION ANALYSIS
Let’s say… we’re performing a sum reduction kernel.
Sequential code:
• Number of steps: O(N)
• Number of operations: (O(N-1))
Parallel code:
• Number of steps: O(log(N))
• Number of operations: (O(N-1))
Both algorithms are work efficient! (<N)
So.. parallel reduction is always faster than sequential code… right? Or is it?
ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING
REDUCTION ANALYSIS (cont’d)
There are log(n) steps and N-1 operations for parallel reduction.
Per step, on average:
(N-1)/(log(n)) operations! Compared to just 1 operation if we were running this
sequentially. If we have no hardware limitation.. Then parallel reduction is obviously
faster.
Are there any downsides?
How many threads are “turned off” per iteration?
What happens to the number of computations after every iteration?
“Narrowing Parallelism” - Threads per kernel is fixed
ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING
PRACTICE PROBLEM
The improved version of the parallel reduction kernel discussed in class is used to reduce
an input array of 15,526 floats. Each thread block in the grid contains 512 threads.
How many Bytes are written to global memory by the grid executing the kernel?
How many times does a single thread block synchronize to reduce its portion of the array
to a single value?
ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING
SCAN
ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING
WHAT IS A SCAN?
A prefix sum (also known as scan) is a fundamental operation that involves
computing the cumulative sum of a sequence of numbers or elements in an
array. The prefix sum operation generates a new array where each element at
index i contains the sum of all elements up to and including index i in the original
array.
These operations are commutative and associative.
Example:
A = [3 1 7 0 4 1 6 3] → A = [3 4 11 11 15 16 22 25]
KOGGE-STONE ALGORITHM
Number of threads equal the number of
elements in shared memory.
if (tx >= stride){
//race condition?!
XY[tx] += XY[tx-stride];
Number of Iterations: log(n)
Number of computations: O(n*log(n))
Is this work efficient?? Think about what
happens if n = 1,000,000 elements
Factor of 20x!
KOGGE-STONE ALGORITHM
Brent-Kung allows each thread block to
process twice the amount of elements.
(e.g., 16 threads can process 32 elements)
“Balanced Trees” conceptual structure.
Traverse down the tree, then traverse back
up the tree.
Number of iterations: 2 * log(n)
Number of computations: O(n)
Brent-Kung offers less number of
computation than Kogge-Stone… making it
work efficient.
KOGGE-STONE VS. BRENT KUNG
Brent-Kung uses half the number of threads compared to Kogge-Stone
Each thread should load two elements into shared memory
However... Brent-Kung takes twice the number of iterations/steps
Kogge-Stone is more popular for parallel scan in industry
A COMPLETE HIERARCHICAL SCAN
• If the input data is too
large, we’ll need to
compute partial scans of
each block.
• We repeat this process until
data can finally be scanned
by one block.
• We then perform vector
addition(s) to get our final
array of scanned values!
PRACTICE PROBLEM
Suppose we need to perform an inclusive scan on an input of 2^42 elements. A
customized GPU supports at most 2^9 blocks per grid and at most 2^12 threads per block.
Using Brent-Kung in a hierarchical fashion, with none of the scan work done by the host,
how many times do we need to launch the Brent-Kung kernel?
ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING
PRACTICE PROBLEM
For the Kogge-Stone scan kernel based on reduction trees, assume that we have 1024
elements in each section and warp size is 32, how many warps in each block will have
control divergence during the iteration where stride is 16?
How many warps in each block will have control divergence during the iteration where
stride is 64?
ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING
HISTOGRAMS
ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING
LET’S LOOK AT DATA RACE CONDITION FIRST..
What if two threads are attempting to write
into or modify the same memory address?
Our kernel cannot function correctly. It might
work sometimes, but we’re creating a scenario
of luck, which is not what we want.
So.. What’s our solution?
ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING
ATOMICS
Hardware ensures that no other thread can access a certain memory location
until the atomic operation is complete.
What happens to the other threads that are trying to access?
• Held in queue
All atomic operations are executed serially! But… that surely can’t be efficient..
Note that atomicity doesn’t constrain to relative order. Whoever accesses the
memory address first gets to go first.
WHAT IS HISTOGRAMMING?
A method for extracting notable features and patterns from large datasets
• Feature extraction for object recognition in images
• Fraud detection in credit card transactions
• Correlating heavenly object movements in astrophysics
Basic histograms - for each element in the data set, use the value to identify a
“bin” to increment
A SIMPLE APPROACH…
Thread 1 and 2 have contention in the first iteration - both updating “M”,
atomicAdd ensures a correct update.
Can we make this better..?
A MORE COALESCED APPROACH
We want to assign inputs into each thread in a stride pattern.
Adjacent threads process adjacent input letters! This way we can better utilize
DRAM bursts.
DISCUSSING LATENCY AND THROUGHPUT ON ATOMICS
Atomics perform a read-modify-write sequence of operations on a given memory location.
Throughput of an atomic operation is the rate at which the application can execute an atomic
operation on a particular location.
The rate is limited by the total latency of the read-modify-write sequence, typically more than
1000 cycles for global memory (DRAM) locations.
This means that if many threads attempt to do atomic operation on the same location
(contention), the memory bandwidth is reduced to < 1/1000!
L2 cache is faster than Global Memory (DRAM)
Shared memory is even faster
Also allows privatized per-block shared memory updates which
But must update global memory at the end
Any other overheads with privatization?
PRACTICE PROBLEM
For a processor that supports atomic operations in L2 cache, assume that each atomic
operation takes 4ns to complete in L2 cache and 100ns to complete in DRAM. Assume that
90% of the atomic operations hit in L2 cache. What is the approximate throughput for
atomic operations on the same global memory variable?
ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING
SPARSE MATRICES
ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING
WHAT ARE SPARSE MATRICES?
Sparse matrices are matrices that have mostly 0s. These matrices are typically
used in scientific computing, data science, and network analysis.
Compared to dense matrix multiplication, SpMVs are irregular and unstructured..
• Have little input data reuse
If we expressed sparse matrices as regular matrices, we would waste a lot of
space
• Limited scalability
SPARSE MATRIX VECTOR MULTIPLICATION
We are trying to
multiply the matrix A by
a vector X.
Sparse matrices are
missing some elements
in a row.
Need the column
pointer to know what to
multiply it with.
COMPRESSED ROW FORMAT (CSR)
Normal Sparse Matrix Representation:
Data array for non-zero value elements
Column pointers
“Compressed” Row Pointers
ISSUES REGARDING SPMV WITH CSR
Right of the bat what are some issues we see..?
Control divergence?
Threads execute different number of iterations
Rows have different amount of non-zero values
Uncoalesced access..
Each thread doesn’t access adjacent memory locations on each iteration
ELL/PACK FORMAT TRIES TO FIX THIS…
What if we tried to pad the rows
and transpose it? What would this
look like?
Zero pad every row to the largest
row in the sparse matrix
Transpose the zero-padded matrix
for DRAM efficiency
COO FORMAT – GOOD ‘OLE TRUSTY
List column and row indexes for every non-zero value in the sparse matrix! This
allows for easy re-ordering.
CAN WE COMBINE ELL AND COO?
What if we had a matrix that had mostly 1~2 elements per row, but some rows
randomly had a lot more non-zero elements than majority?
ELL format for the majority of the rows
COO format for the rows that are outliers
PRACTICE PROBLEM
Convert the following matrix into CSR representation and COO representation:
ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING
JAGGED DIAGONAL SPARSE (JDS)
What if we took the CSR format, but optimized it
better? What are some problems with CSR format?
Number of elements in a row are random
SpMV can cause a large amount of control divergence
if the number of iterations per threads vary a lot
…which means block performance is determined by
the longest row in the block itself
• Uncoalesced access pattern of input data
Solution: Sort the matrix by the number of elements
in a row in descending order. We can also transpose
the matrix to access elements in a coalesced manner.
JDS REGULARIZATION
CSR TO JDS CONVERSION
Row pointers much change,
and now the newly
arranged rows must have
some pointers back to the
original row it was part of.
Jds_row_ptr = similar to
original csr row_ptr
Jds_row_perm = pointer to
the original row order of
any given jds_row
BUT WHAT ABOUT MEMORY COALESCING?
With the sorting of rows in
descending order, we now have
threads take similar amounts of
non-zero values.
What about efficient data
access pattern? Solution:
transpose our matrix
Threads 0 takes: 2, 4, 1
Thread 1 takes: 3, 1
Thread 3 takes 1, 1
This is very similar to lab 2 & 3!
PRACTICE PROBLEM
What type of sparse matrix should be use if our data distribution is….
Roughly random?
High variance in rows? (some really long and some really short)
Super sparse?
Roughly triangular?
ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING
PROJECT RECAP
ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING
CONVOLUTION RECAP
ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING
INPUT MATRIX UNROLLING
The central idea is unfolding and replicating
the inputs to the convolutional kernel such
that all elements needed to compute one
output element will be stored as one
sequential column.
Note that the threads would traverse from
top to bottom, thus access the data in a
coalesced pattern.
Example on the right:
1 full image
3 channels
2x2 mask
STREAMS
Stream: a sequence of operations that execute in order. Streams all individually
do work and can be utilized to speed up overall kernel execution when work can
be divided up independently.
H2D K<<<>>> Strm 1
H2D K<<<>>> H2D K<<<>>>
H2D K<<<>>> Strm 2
Time
Time
Note that streams are performing
asynchronous memory transfers. But memory
transfers are expensive and have a lot of
overhead. How do we solve this issue?
LET’S TALK ABOUT DIRECT MEMORY ACCESS
CUDA GPUs typically use Direct Memory Access (DMA) for memory transfers
between host and device. But what is DMA?
DMA utilizes the full bandwidth of an I/O bus
Uses physical address for source and destination
Allows memory transfers without involving CPU
Direct transfer of memory from Main Memory and GPU
Requires pinned memory… but why?
When you use CUDA APIs such as cudaMemcpy to transfer data between the
host and device, CUDA internally uses DMA engines to perform these transfers.
These DMA engines are specialized hardware units within the GPU that are
optimized for moving data between different memory spaces (e.g., host memory
and device memory) efficiently.
PAGEABLE VS PINNED MEMORY
When data is in pageable memory, it means that it could exist in the physical memory, or it could be swapped out a secondary
source of memory (i.e., disk/ssd storage). Let’s walk through a scenario..
Scenario: CUDA wants to perform memory copy from Host to Device
CUDA runtime library detects that the required data is not in physical memory (DRAM), it triggers a page fault.
Handling a page fault requires the CPU to intervene, bringing the data from disk to physical memory. (page-in)
First overhead
Disk I/O operations take up a lot of time!
Once the data is in physical memory, CUDA asks CPU to move the data into pinned memory. Data in pinned memory cannot be
paged out to disk until it is officially unpinned.
Second overhead
If not pinned, OS could accidentally page out the data that is being read or written by a DMA.
After pinned, memory transfer from Host to Device can finally happen. CUDA will then ask host to free memory from being pinned.
Minimizing overheads from steps 2 and 3 are crucial in memory management. For asynchronous memory transfers to happen using
streams, the required data must be pinned. Data transfers on already-pinned memory runs around 2x times faster.
GPU ARCHITECTURE
ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING
WHAT IS PCI?
Peripheral Component Interconnect (PCI)
• Originally 33 MHz, 32-bit wide, later doubled
• Upstream bandwidth was still slow (~256MB/s peak), and
needed to arbitrate the bus in original GPU in PC
architecture
BUT was cool due to its use as memory mapped I/O.
• Addresses are assigned to PCI devices at boot time, and
the devices listen for their respective address
• Regular loads and stores can be used to communicate
with PCI devices
PCIe CONFIG
Each generation of PCIe aims to double the bandwidth and power efficiency.
PCIe original was 2 GT/s, but now the latest PCIe 6.0 is 64 GT/s.
With a generation, the number of lanes in a link can also be scaled to have more
distinct physical channels
• x1, x2, x4, x8, x16, x32, ….
SCALABLE COMPUTING
ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING
SCALABLE COMPUTING AND DATACENTERS
Vikram’s lecture but super fast and a lot less detailed
• Number of operations is outpacing compute scaling
• AI models are continuing to grow to more and more parameters to avoid over
training, and they’re already good at what they do
• Power, latency, data scarcity, chip production are factors
• Data parallelism, pipeline parallelism, and model parallelism help break down
these massive models to sizable units sent to different GPUs during training
• Inference has its own problems that harm performance
TENSOR OPERATIONS
ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING
TENSOR OPERATIONS
A lot of parallel programming applications and AI models are just many matrix
multiplications one after another.
Matrix multiplication and accumulate is our limiting factor, if we speed that up
then it would have huge gains, even if you improve by only a fraction of a percent.
If we must matrix multiply so much, why not just have specialized hardware for the
role?
At the thread level, for tiled matrix multiplication, you must loop through the tile
and calculated a partial dot product. But if the tile width was fixed, hardware can
specialize to perform this task.
TENSOR CORES
For a 4x4 tile example, this is what the naïve hardware would look like:
But you already know the register access pattern as well, what if we significantly
simplify the loads from shared memory to the register file?
TENSOR CORES OPTIMIZED
Each thread only loads 2 values from shared memory to the register file, and then
the hardware’s routing network routes the values to computational units.