0% found this document useful (0 votes)

21 views58 pages

ECE408 MT2 Review FA24

Uploaded by

giantballest

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views58 pages

ECE408 MT2 Review FA24

Uploaded by

giantballest

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 58

ECE 408/CS 483 MIDTERM 2 REVIEW

Electrical & Computer Engineering

Fall 2024

HRISHI SHAH

12/08/2024
EXAM DETAILS
• Tuesday, December 10th, 2024, from 7:00 PM to 10:00 PM
• Closed book exam and no cheat sheets allowed
• Coverage:
o All lectures, including guest lectures
o All labs
o All project milestones
o Focused on (but not limited to!) material not in
Exam 1

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

HOW TO PREPARE
• Reread lectures and write a few takeaways (also attempt end of lecture questions)
• Read through textbook for more difficult topics
• NVIDIA Documentation for basic parallel programming ideas
• Check your mistakes on previous labs and quizzes
• Make sure to understand your code, even if its from lecture
• Past exams (several provided in files section of Canvas)
• Make your own review sheet (can’t use on exam)
• Review MT1 closely ( )

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

KEY TOPICS
• Reduction
• Scan
• Histograms
• Shared Memory and Synchronization
• Sparse Matrix Multiplication
• Project (Streaming!!)
• GPU Architecture
• Scalable Computing
• Accelerator APIs and Tensor Cores

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

REDUCTION

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

REDUCTION TREES

How can we reduce a large set of data to a

single value?

We want operations that do not depend on

other memory spaces (need to be parallelized).

The operators must then be commutative (1)

and associative (2), meaning they have the
same results regardless of
1 - ordering
2 - grouping But what benefit does parallel
reduction have?

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

NAÏVE REDUCTION

• 8 Elements
• 4 Threads
• Stride starts at 1, and multiplies by 2 at
each iteration
• When does stride stop?

• What are some problems with this?

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

NAÏVE REDUCTION PROBLEM

Like using shared memory for matrix

multiplication, convolution, etc., we need to
ensure that all thread has completed and
stored its value before proceeding.

Need syncthreads(), can this slow down our

kernel?

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

BETTER REDUCTION

Compact the partial sums and store them at

continuous regions of memory.

“Turn off” the other half of the threads at

each iteration

Why is this better? The amount of work is the

same?

Compare control divergence between the

two implementations

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

REDUCTION ANALYSIS

Let’s say… we’re performing a sum reduction kernel.

Sequential code:
• Number of steps: O(N)
• Number of operations: (O(N-1))
Parallel code:
• Number of steps: O(log(N))
• Number of operations: (O(N-1))
Both algorithms are work efficient! (<N)
So.. parallel reduction is always faster than sequential code… right? Or is it?

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

REDUCTION ANALYSIS (cont’d)

There are log(n) steps and N-1 operations for parallel reduction.
Per step, on average:
(N-1)/(log(n)) operations! Compared to just 1 operation if we were running this
sequentially. If we have no hardware limitation.. Then parallel reduction is obviously
faster.
Are there any downsides?
How many threads are “turned off” per iteration?
What happens to the number of computations after every iteration?
“Narrowing Parallelism” - Threads per kernel is fixed

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

PRACTICE PROBLEM
The improved version of the parallel reduction kernel discussed in class is used to reduce
an input array of 15,526 floats. Each thread block in the grid contains 512 threads.

How many Bytes are written to global memory by the grid executing the kernel?

How many times does a single thread block synchronize to reduce its portion of the array
to a single value?

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

SCAN

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

WHAT IS A SCAN?

A prefix sum (also known as scan) is a fundamental operation that involves

computing the cumulative sum of a sequence of numbers or elements in an
array. The prefix sum operation generates a new array where each element at
index i contains the sum of all elements up to and including index i in the original
array.
These operations are commutative and associative.
Example:
A = [3 1 7 0 4 1 6 3] → A = [3 4 11 11 15 16 22 25]
KOGGE-STONE ALGORITHM

Number of threads equal the number of

elements in shared memory.

if (tx >= stride){

//race condition?!

XY[tx] += XY[tx-stride];

Number of Iterations: log(n)

Number of computations: O(n*log(n))

Is this work efficient?? Think about what

happens if n = 1,000,000 elements

Factor of 20x!
KOGGE-STONE ALGORITHM

Brent-Kung allows each thread block to

process twice the amount of elements.
(e.g., 16 threads can process 32 elements)
“Balanced Trees” conceptual structure.
Traverse down the tree, then traverse back
up the tree.
Number of iterations: 2 * log(n)
Number of computations: O(n)
Brent-Kung offers less number of
computation than Kogge-Stone… making it
work efficient.
KOGGE-STONE VS. BRENT KUNG

Brent-Kung uses half the number of threads compared to Kogge-Stone

Each thread should load two elements into shared memory
However... Brent-Kung takes twice the number of iterations/steps
Kogge-Stone is more popular for parallel scan in industry
A COMPLETE HIERARCHICAL SCAN

• If the input data is too

large, we’ll need to
compute partial scans of
each block.
• We repeat this process until
data can finally be scanned
by one block.
• We then perform vector
addition(s) to get our final
array of scanned values!
PRACTICE PROBLEM
Suppose we need to perform an inclusive scan on an input of 2^42 elements. A
customized GPU supports at most 2^9 blocks per grid and at most 2^12 threads per block.
Using Brent-Kung in a hierarchical fashion, with none of the scan work done by the host,
how many times do we need to launch the Brent-Kung kernel?

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

PRACTICE PROBLEM
For the Kogge-Stone scan kernel based on reduction trees, assume that we have 1024
elements in each section and warp size is 32, how many warps in each block will have
control divergence during the iteration where stride is 16?
How many warps in each block will have control divergence during the iteration where
stride is 64?

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

HISTOGRAMS

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

LET’S LOOK AT DATA RACE CONDITION FIRST..

What if two threads are attempting to write

into or modify the same memory address?
Our kernel cannot function correctly. It might
work sometimes, but we’re creating a scenario
of luck, which is not what we want.
So.. What’s our solution?

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

ATOMICS

Hardware ensures that no other thread can access a certain memory location
until the atomic operation is complete.
What happens to the other threads that are trying to access?
• Held in queue
All atomic operations are executed serially! But… that surely can’t be efficient..
Note that atomicity doesn’t constrain to relative order. Whoever accesses the
memory address first gets to go first.
WHAT IS HISTOGRAMMING?

A method for extracting notable features and patterns from large datasets
• Feature extraction for object recognition in images
• Fraud detection in credit card transactions
• Correlating heavenly object movements in astrophysics
Basic histograms - for each element in the data set, use the value to identify a
“bin” to increment
A SIMPLE APPROACH…

Thread 1 and 2 have contention in the first iteration - both updating “M”,
atomicAdd ensures a correct update.
Can we make this better..?
A MORE COALESCED APPROACH

We want to assign inputs into each thread in a stride pattern.

Adjacent threads process adjacent input letters! This way we can better utilize
DRAM bursts.
DISCUSSING LATENCY AND THROUGHPUT ON ATOMICS

Atomics perform a read-modify-write sequence of operations on a given memory location.

Throughput of an atomic operation is the rate at which the application can execute an atomic
operation on a particular location.
The rate is limited by the total latency of the read-modify-write sequence, typically more than
1000 cycles for global memory (DRAM) locations.
This means that if many threads attempt to do atomic operation on the same location
(contention), the memory bandwidth is reduced to < 1/1000!
L2 cache is faster than Global Memory (DRAM)
Shared memory is even faster
Also allows privatized per-block shared memory updates which
But must update global memory at the end
Any other overheads with privatization?
PRACTICE PROBLEM
For a processor that supports atomic operations in L2 cache, assume that each atomic
operation takes 4ns to complete in L2 cache and 100ns to complete in DRAM. Assume that
90% of the atomic operations hit in L2 cache. What is the approximate throughput for
atomic operations on the same global memory variable?

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

SPARSE MATRICES

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

WHAT ARE SPARSE MATRICES?

Sparse matrices are matrices that have mostly 0s. These matrices are typically
used in scientific computing, data science, and network analysis.
Compared to dense matrix multiplication, SpMVs are irregular and unstructured..
• Have little input data reuse
If we expressed sparse matrices as regular matrices, we would waste a lot of
space
• Limited scalability
SPARSE MATRIX VECTOR MULTIPLICATION

We are trying to
multiply the matrix A by
a vector X.
Sparse matrices are
missing some elements
in a row.
Need the column
pointer to know what to
multiply it with.
COMPRESSED ROW FORMAT (CSR)

Normal Sparse Matrix Representation:

Data array for non-zero value elements
Column pointers
“Compressed” Row Pointers
ISSUES REGARDING SPMV WITH CSR

Right of the bat what are some issues we see..?

Control divergence?
Threads execute different number of iterations
Rows have different amount of non-zero values
Uncoalesced access..
Each thread doesn’t access adjacent memory locations on each iteration
ELL/PACK FORMAT TRIES TO FIX THIS…

What if we tried to pad the rows

and transpose it? What would this
look like?
Zero pad every row to the largest
row in the sparse matrix
Transpose the zero-padded matrix
for DRAM efficiency
COO FORMAT – GOOD ‘OLE TRUSTY

List column and row indexes for every non-zero value in the sparse matrix! This
allows for easy re-ordering.
CAN WE COMBINE ELL AND COO?

What if we had a matrix that had mostly 1~2 elements per row, but some rows
randomly had a lot more non-zero elements than majority?
ELL format for the majority of the rows
COO format for the rows that are outliers
PRACTICE PROBLEM
Convert the following matrix into CSR representation and COO representation:

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

JAGGED DIAGONAL SPARSE (JDS)

What if we took the CSR format, but optimized it

better? What are some problems with CSR format?
Number of elements in a row are random
SpMV can cause a large amount of control divergence
if the number of iterations per threads vary a lot
…which means block performance is determined by
the longest row in the block itself
• Uncoalesced access pattern of input data

Solution: Sort the matrix by the number of elements

in a row in descending order. We can also transpose
the matrix to access elements in a coalesced manner.
JDS REGULARIZATION
CSR TO JDS CONVERSION

Row pointers much change,

and now the newly
arranged rows must have
some pointers back to the
original row it was part of.

Jds_row_ptr = similar to
original csr row_ptr
Jds_row_perm = pointer to
the original row order of
any given jds_row
BUT WHAT ABOUT MEMORY COALESCING?

With the sorting of rows in

descending order, we now have
threads take similar amounts of
non-zero values.
What about efficient data
access pattern? Solution:
transpose our matrix
Threads 0 takes: 2, 4, 1
Thread 1 takes: 3, 1
Thread 3 takes 1, 1
This is very similar to lab 2 & 3!
PRACTICE PROBLEM
What type of sparse matrix should be use if our data distribution is….
Roughly random?
High variance in rows? (some really long and some really short)
Super sparse?
Roughly triangular?

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

PROJECT RECAP

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

CONVOLUTION RECAP

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

INPUT MATRIX UNROLLING

The central idea is unfolding and replicating

the inputs to the convolutional kernel such
that all elements needed to compute one
output element will be stored as one
sequential column.
Note that the threads would traverse from
top to bottom, thus access the data in a
coalesced pattern.
Example on the right:
1 full image
3 channels
2x2 mask
STREAMS

Stream: a sequence of operations that execute in order. Streams all individually

do work and can be utilized to speed up overall kernel execution when work can
be divided up independently.

H2D K<<<>>> Strm 1

H2D K<<<>>> H2D K<<<>>>

H2D K<<<>>> Strm 2
Time

Time

Note that streams are performing

asynchronous memory transfers. But memory
transfers are expensive and have a lot of
overhead. How do we solve this issue?
LET’S TALK ABOUT DIRECT MEMORY ACCESS

CUDA GPUs typically use Direct Memory Access (DMA) for memory transfers
between host and device. But what is DMA?
DMA utilizes the full bandwidth of an I/O bus
Uses physical address for source and destination
Allows memory transfers without involving CPU
Direct transfer of memory from Main Memory and GPU
Requires pinned memory… but why?

When you use CUDA APIs such as cudaMemcpy to transfer data between the
host and device, CUDA internally uses DMA engines to perform these transfers.

These DMA engines are specialized hardware units within the GPU that are
optimized for moving data between different memory spaces (e.g., host memory
and device memory) efficiently.
PAGEABLE VS PINNED MEMORY

When data is in pageable memory, it means that it could exist in the physical memory, or it could be swapped out a secondary
source of memory (i.e., disk/ssd storage). Let’s walk through a scenario..
Scenario: CUDA wants to perform memory copy from Host to Device
CUDA runtime library detects that the required data is not in physical memory (DRAM), it triggers a page fault.
Handling a page fault requires the CPU to intervene, bringing the data from disk to physical memory. (page-in)
First overhead
Disk I/O operations take up a lot of time!
Once the data is in physical memory, CUDA asks CPU to move the data into pinned memory. Data in pinned memory cannot be
paged out to disk until it is officially unpinned.
Second overhead
If not pinned, OS could accidentally page out the data that is being read or written by a DMA.
After pinned, memory transfer from Host to Device can finally happen. CUDA will then ask host to free memory from being pinned.
Minimizing overheads from steps 2 and 3 are crucial in memory management. For asynchronous memory transfers to happen using
streams, the required data must be pinned. Data transfers on already-pinned memory runs around 2x times faster.
GPU ARCHITECTURE

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

WHAT IS PCI?

Peripheral Component Interconnect (PCI)

• Originally 33 MHz, 32-bit wide, later doubled
• Upstream bandwidth was still slow (~256MB/s peak), and
needed to arbitrate the bus in original GPU in PC
architecture
BUT was cool due to its use as memory mapped I/O.
• Addresses are assigned to PCI devices at boot time, and
the devices listen for their respective address
• Regular loads and stores can be used to communicate
with PCI devices
PCIe CONFIG

Each generation of PCIe aims to double the bandwidth and power efficiency.
PCIe original was 2 GT/s, but now the latest PCIe 6.0 is 64 GT/s.
With a generation, the number of lanes in a link can also be scaled to have more
distinct physical channels
• x1, x2, x4, x8, x16, x32, ….
SCALABLE COMPUTING

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

SCALABLE COMPUTING AND DATACENTERS

Vikram’s lecture but super fast and a lot less detailed

• Number of operations is outpacing compute scaling
• AI models are continuing to grow to more and more parameters to avoid over
training, and they’re already good at what they do
• Power, latency, data scarcity, chip production are factors
• Data parallelism, pipeline parallelism, and model parallelism help break down
these massive models to sizable units sent to different GPUs during training
• Inference has its own problems that harm performance
TENSOR OPERATIONS

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

TENSOR OPERATIONS

A lot of parallel programming applications and AI models are just many matrix
multiplications one after another.
Matrix multiplication and accumulate is our limiting factor, if we speed that up
then it would have huge gains, even if you improve by only a fraction of a percent.
If we must matrix multiply so much, why not just have specialized hardware for the
role?
At the thread level, for tiled matrix multiplication, you must loop through the tile
and calculated a partial dot product. But if the tile width was fixed, hardware can
specialize to perform this task.
TENSOR CORES

For a 4x4 tile example, this is what the naïve hardware would look like:

But you already know the register access pattern as well, what if we significantly
simplify the loads from shared memory to the register file?
TENSOR CORES OPTIMIZED

Each thread only loads 2 values from shared memory to the register file, and then
the hardware’s routing network routes the values to computational units.

Perfbook-Eb 2023 06 11a
No ratings yet
Perfbook-Eb 2023 06 11a
1,432 pages
CSE524sp10 01
No ratings yet
CSE524sp10 01
62 pages
Perfbook-1c 2022 09 25a
No ratings yet
Perfbook-1c 2022 09 25a
950 pages
The Parallel Book
No ratings yet
The Parallel Book
646 pages
Perfbook 2023 06 11a
No ratings yet
Perfbook 2023 06 11a
662 pages
Perfbook-1c 2021 12 22a
No ratings yet
Perfbook-1c 2021 12 22a
930 pages
Web GPU
0% (1)
Web GPU
40 pages
14 Parallel Algorithms CUDA Basics s20
No ratings yet
14 Parallel Algorithms CUDA Basics s20
89 pages
Reduction
No ratings yet
Reduction
91 pages
Perfbook-1c 2019 12 22a PDF
No ratings yet
Perfbook-1c 2019 12 22a PDF
825 pages
08 Dataparallel
No ratings yet
08 Dataparallel
51 pages
Ece408 Lecture13 Reduction Tree VK FL24
No ratings yet
Ece408 Lecture13 Reduction Tree VK FL24
45 pages
Parralel Demro 002
No ratings yet
Parralel Demro 002
61 pages
GPU Programming Basics - Slides
No ratings yet
GPU Programming Basics - Slides
68 pages
Is Parallel Programming Hard, And, If So, What Can You Do About It V2021.12.22a
No ratings yet
Is Parallel Programming Hard, And, If So, What Can You Do About It V2021.12.22a
630 pages
Parallel Computing A Comparative
No ratings yet
Parallel Computing A Comparative
65 pages
HPC Parallel
No ratings yet
HPC Parallel
122 pages
Pda 4
No ratings yet
Pda 4
82 pages
Lecture 10
No ratings yet
Lecture 10
40 pages
Architectures For Parrallel Computation
No ratings yet
Architectures For Parrallel Computation
40 pages
217 Lec10
No ratings yet
217 Lec10
27 pages
05 - Atomics - Reductions - Warp - Shuffle 05 - Atomics - Reductions - Warp - Shuffle
No ratings yet
05 - Atomics - Reductions - Warp - Shuffle 05 - Atomics - Reductions - Warp - Shuffle
27 pages
2023 CSC14120 Lecture04 CUDAParallelExecution (P2)
No ratings yet
2023 CSC14120 Lecture04 CUDAParallelExecution (P2)
24 pages
Parallel Algorithms: Theory and Practice
No ratings yet
Parallel Algorithms: Theory and Practice
44 pages
In3200 Chap05
No ratings yet
In3200 Chap05
34 pages
DS1822 ParallelComputing Unit4
No ratings yet
DS1822 ParallelComputing Unit4
16 pages
Parallel Programming
No ratings yet
Parallel Programming
692 pages
ECE408 S19 ZJUI Exam1 Study Guide
No ratings yet
ECE408 S19 ZJUI Exam1 Study Guide
25 pages
Written Asst2
No ratings yet
Written Asst2
27 pages
Pdclab 6
No ratings yet
Pdclab 6
15 pages
Pap 3 Shared Memory Algos
No ratings yet
Pap 3 Shared Memory Algos
23 pages
Applied Parallel Computing-Honest
100% (1)
Applied Parallel Computing-Honest
218 pages
Class4 Advanced Cuda Opencl
No ratings yet
Class4 Advanced Cuda Opencl
64 pages
Scan Primitives
No ratings yet
Scan Primitives
11 pages
3-Parallel Software
No ratings yet
3-Parallel Software
35 pages
CUDA Tricks PDF
No ratings yet
CUDA Tricks PDF
33 pages
Database For Data-Analysis: Developer: Ying Chen (Jlab) Computing 3 (Or N) - PT Functions
No ratings yet
Database For Data-Analysis: Developer: Ying Chen (Jlab) Computing 3 (Or N) - PT Functions
26 pages
UNIT V Parallel Programming Patterns in CUDA (T2 Chapter 7) - P P With CUDA
No ratings yet
UNIT V Parallel Programming Patterns in CUDA (T2 Chapter 7) - P P With CUDA
35 pages
ECE408 2012 Practice Exam1
No ratings yet
ECE408 2012 Practice Exam1
10 pages
001 - DDS IIIT Jan 10th
No ratings yet
001 - DDS IIIT Jan 10th
34 pages
Parallel Prefix Sum (Scan) With CUDA: Mark Harris
No ratings yet
Parallel Prefix Sum (Scan) With CUDA: Mark Harris
21 pages
09 ParallelizationRecap PDF
No ratings yet
09 ParallelizationRecap PDF
62 pages
CS4230 Parallel Programming Introduction To Parallel Algorithms
No ratings yet
CS4230 Parallel Programming Introduction To Parallel Algorithms
25 pages
Chapter Parallel Prefix Sum
No ratings yet
Chapter Parallel Prefix Sum
21 pages
HPC Unit 1
100% (1)
HPC Unit 1
12 pages
Function Modules in ABAP PDF
No ratings yet
Function Modules in ABAP PDF
85 pages
1.1 Parallelism
No ratings yet
1.1 Parallelism
29 pages
Data - Parallel Algorithms On Gpus
No ratings yet
Data - Parallel Algorithms On Gpus
31 pages
Reduction
No ratings yet
Reduction
9 pages
PowerBI 1 To 151 05 01 2021
No ratings yet
PowerBI 1 To 151 05 01 2021
43 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
B500994R01-Rev J-MSI Installation Guide
No ratings yet
B500994R01-Rev J-MSI Installation Guide
93 pages
Lecture Parallelism DC PDF
No ratings yet
Lecture Parallelism DC PDF
7 pages
COCOMO Model
No ratings yet
COCOMO Model
51 pages
Introduction
No ratings yet
Introduction
46 pages
Parallel N Distributed Systems
No ratings yet
Parallel N Distributed Systems
44 pages
Unit 2
No ratings yet
Unit 2
46 pages
ATENA Science GiD Tutorial
100% (1)
ATENA Science GiD Tutorial
145 pages
S R T S: OME Esearch Opics For Tudents
No ratings yet
S R T S: OME Esearch Opics For Tudents
3 pages
A Review On Use of MPI in Parallel Algorithms: IPASJ International Journal of Computer Science (IIJCS)
No ratings yet
A Review On Use of MPI in Parallel Algorithms: IPASJ International Journal of Computer Science (IIJCS)
8 pages
SQL Notes Full PDF
No ratings yet
SQL Notes Full PDF
72 pages
Cs344 - Lesson 2 - GPU Hardware and Parallel Communication Patterns - Udacity
No ratings yet
Cs344 - Lesson 2 - GPU Hardware and Parallel Communication Patterns - Udacity
14 pages
P2000 Security Managment System
No ratings yet
P2000 Security Managment System
245 pages
ISO27k ISMS 6.1 SoA 2022
No ratings yet
ISO27k ISMS 6.1 SoA 2022
5 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
Openstack 101 by Jason Kalai
No ratings yet
Openstack 101 by Jason Kalai
32 pages
Generations of Programming Languages
No ratings yet
Generations of Programming Languages
35 pages
FN 22nd Jan Saudi RNS Corporate Presentation 2023-24
No ratings yet
FN 22nd Jan Saudi RNS Corporate Presentation 2023-24
30 pages
Sales Brochure - Biesse Klever G FT 2014 en
No ratings yet
Sales Brochure - Biesse Klever G FT 2014 en
16 pages
Gujarat Technological University: Page 1 of 3
No ratings yet
Gujarat Technological University: Page 1 of 3
3 pages
Encryption As A Service (EaaS) : Introducing The Full-Cloud-Fog Architecture For Enhanced Performance and Security
No ratings yet
Encryption As A Service (EaaS) : Introducing The Full-Cloud-Fog Architecture For Enhanced Performance and Security
23 pages
Control Room SOP v1.0
No ratings yet
Control Room SOP v1.0
29 pages
ND40 Service Manual QUICK START Revd
No ratings yet
ND40 Service Manual QUICK START Revd
29 pages
CSC186 SchemeOfWorkMac2025Jul2025 Signed
No ratings yet
CSC186 SchemeOfWorkMac2025Jul2025 Signed
6 pages
Modern Maths - DPP 02 - MBA Elite 2024
No ratings yet
Modern Maths - DPP 02 - MBA Elite 2024
12 pages
AP Computer Science Course Description
No ratings yet
AP Computer Science Course Description
72 pages
ChatGPT As A Software Architect - 25 Prompts
No ratings yet
ChatGPT As A Software Architect - 25 Prompts
4 pages
BS EN 1321 - Destructive Tests On Welds in Metallic Materials - Macroscopic and Microscopic Examination of Welds
No ratings yet
BS EN 1321 - Destructive Tests On Welds in Metallic Materials - Macroscopic and Microscopic Examination of Welds
4 pages
Be A Pioneer - Script
No ratings yet
Be A Pioneer - Script
11 pages
Pro Club 777
No ratings yet
Pro Club 777
5 pages
SSTV Sell
No ratings yet
SSTV Sell
15 pages
20bcs3023 Dbms Assign 1 Ques 1
No ratings yet
20bcs3023 Dbms Assign 1 Ques 1
5 pages
Stop Unwanted Autorun Files
No ratings yet
Stop Unwanted Autorun Files
6 pages
Rules For Defining Data Identifier
No ratings yet
Rules For Defining Data Identifier
5 pages
Ticket Information: Data Block Editor
No ratings yet
Ticket Information: Data Block Editor
3 pages
Fariz Rachman Yusuf
No ratings yet
Fariz Rachman Yusuf
1 page

ECE408 MT2 Review FA24

Uploaded by

ECE408 MT2 Review FA24

Uploaded by

ECE 408/CS 483 MIDTERM 2 REVIEW

Electrical & Computer Engineering

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

How can we reduce a large set of data to a

We want operations that do not depend on

The operators must then be commutative (1)

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

• What are some problems with this?

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

Like using shared memory for matrix

Need syncthreads(), can this slow down our

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

Compact the partial sums and store them at

“Turn off” the other half of the threads at

Why is this better? The amount of work is the

Compare control divergence between the

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

Let’s say… we’re performing a sum reduction kernel.

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

A prefix sum (also known as scan) is a fundamental operation that involves

Number of threads equal the number of

if (tx >= stride){

Number of Iterations: log(n)

Number of computations: O(n*log(n))

Is this work efficient?? Think about what

Brent-Kung allows each thread block to

Brent-Kung uses half the number of threads compared to Kogge-Stone

• If the input data is too

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

What if two threads are attempting to write

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

We want to assign inputs into each thread in a stride pattern.

Atomics perform a read-modify-write sequence of operations on a given memory location.

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

Normal Sparse Matrix Representation:

Right of the bat what are some issues we see..?

What if we tried to pad the rows

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

What if we took the CSR format, but optimized it

Solution: Sort the matrix by the number of elements

Row pointers much change,

With the sorting of rows in

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

The central idea is unfolding and replicating

Stream: a sequence of operations that execute in order. Streams all individually

H2D K<<<>>> Strm 1

H2D K<<<>>> H2D K<<<>>>

Note that streams are performing

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

Peripheral Component Interconnect (PCI)

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

Vikram’s lecture but super fast and a lot less detailed

ELECTRICAL & COMPUTER ENGINEERING GRAINGER ENGINEERING

You might also like