0% found this document useful (0 votes)
126 views20 pages

GPU - Final - Gradescope

Uploaded by

EMCUBE MELO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
126 views20 pages

GPU - Final - Gradescope

Uploaded by

EMCUBE MELO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

9/19/23, 11:15 PM View Submission | Gradescope

Q1
2 Points

Which of the following memory types has the slowest access


speed?
L2 Cache
Shared Memory
Register
L1 Cache
Global Memory

Q2
4 Points

Assume that a kernel is launched with 16 thread blocks, each with


512 threads.
If a variable is declared as a shared memory variable, how many
versions of the variable will be created throughout the lifetime of
the execution of the kernel?

16

Q3
4 Points

For the below code snippet, which two instructions would


experience overlapped execution?

https://www.gradescope.com/courses/461455/assignments/2467151/submissions/151565671 1/20
9/19/23, 11:15 PM View Submission | Gradescope

Q4
2 Points

In order to launch concurrently running kernels on multiple


GPUs, programmers are required to use CUDA streams.
False
True

https://www.gradescope.com/courses/461455/assignments/2467151/submissions/151565671 2/20
9/19/23, 11:15 PM View Submission | Gradescope

Q5
12 Points

Comprehensive Exam Problem - Matrix Multiply

Q5.1
2 Points

For our tiled matrix multiplication kernel, if we use a 16x16 tile,


what is the reduction of memory bandwidth usage for input
matrices M and N?
~1/32 of the original usage
~1/8 of the original usage
~1/16 of the original usage
~1/64 of the original usage

Q5.2
4 Points

For a tiled single-precision (32-bit) matrix multiplication kernel,


assume that each thread block is 32x32 and the system has a
DRAM burst size of 128 bytes.
How many DRAM bursts will be delivered to the processor as a
result of loading on M-matrix tile by a thread block (during one
phase)?
Keep in mind that each single prediction floating point number is
four bytes.

32

Q5.3
3 Points

In the basic Matrix Multiply code, if matrix M , N , and P are of size


100x100, how many total bytes of data are transferred from
device to host during the lifetime of execution?

https://www.gradescope.com/courses/461455/assignments/2467151/submissions/151565671 3/20
9/19/23, 11:15 PM View Submission | Gradescope

Assume matrix multiply operates on double-precision (64-bit)


floating point numbers.

80000

Q5.4
3 Points

Assume a tiled matrix multiplication that handles boundary


conditions. Assume that we use 32 X32 tiles to process
rectangular matrices of (1,000 X 2,000) and (2,000 x 2,000).
How many thread blocks are launched?

(2000/32)*(2000/32)

Q6
2 Points

Variables stored in registers are visible to:

All warps in a thread block

All threads in a kernel

A single thread only

All threads in a thread block

Q7
3 Points

For the code snippet below which shows a basic histogram


kernel, how many bins does the histogram histo[] have?

https://www.gradescope.com/courses/461455/assignments/2467151/submissions/151565671 4/20
9/19/23, 11:15 PM View Submission | Gradescope

Q8
2 Points

Which of the following is true?

GPUs do not use program counters to access instruction

All threads in a warp share the same program counter

All warps in an SM share the same program counter

Instructions in a warp are processed out-of-order

Warps consist of multiple thread blocks

Q9
2 Points

https://www.gradescope.com/courses/461455/assignments/2467151/submissions/151565671 5/20
9/19/23, 11:15 PM View Submission | Gradescope

If your CUDA application has a single stream, can you


concurrently copy data between CPU/GPU and execute a kernel
at the same time?
Yes
No

Q10
2 Points

For the code snippet below which shows a basic histogram


kernel, is the access to buffer[] coalesced or not coalesced?

Not Coalesced
Coalesced

Q11
5 Points

What are some factors that are pushing emerging GPU designs
to adopt Multi-Chip Modules? and what are the implications of
https://www.gradescope.com/courses/461455/assignments/2467151/submissions/151565671 6/20
9/19/23, 11:15 PM View Submission | Gradescope

emerging MCM designs?

(Please keep your answer concise.)

In general, an MCM is a microcircuit that has a silicon-to-


substrate density greater than 30%.
The GPU is optimized by slicing the larger chip into smaller
chips using MCM.
Multi-chip module (MCM) graphics cards are the future as
they have a lot of impact on the performance

Q12
2 Points

Which of the following statements is true?

If a pageable data is to be transferred by cudyMemcpy(), it


needs to be first copied to a pinned memory buffer before
transferred.

Data transfer between CUDA device and host is done by


DMA hardware using virtual addresses.

Pinned memory is allocated with cudaMalloc() function.

The OS always guarantees that any memory being used by


DMA hardware is not swapped out.

Q13
2 Points

Each CUDA Stream is a [___] of operations.

https://www.gradescope.com/courses/461455/assignments/2467151/submissions/151565671 7/20
9/19/23, 11:15 PM View Submission | Gradescope

Command
Queue
Heap
Stack
Event

Q14
3 Points

As data is transferred from host to device using cudaMemCopy it


touches various types of hardware components (memories,
buses, on-chip components, etc.). Order the following hardware
components in the order of being used during a cudaMemCopy
H2D operation.

Host Memory -> DMA Engine -> Pinned Memory -> PCIe Bus ->
Global Memory
Pinned Memory -> Host Memory -> PCIe Bus -> DMA Engine ->
Global Memory
Pinned Memory -> Host Memory -> DMA Engine -> PCIe Bus ->
Global Memory
Host Memory -> Pinned Memory -> DMA Engine -> PCIe Bus ->
Global Memory
Host Memory -> Pinned Memory -> PCIe Bus -> DMA Engine ->
Global Memory

Q15
2 Points

Unified memory does not support prefetching of pages between


CPU/GPU.
False
True

https://www.gradescope.com/courses/461455/assignments/2467151/submissions/151565671 8/20
9/19/23, 11:15 PM View Submission | Gradescope

Q16
3 Points

When allocating data with cudaMallocManaged, where are the


memory pages physically allocated? (Assume Unified Memory
behavior on Pascal architectures and onward.)

CPU
No where
GPU

Q17
2 Points

cudaMallocManaged allocates memory to:


Global Memory
Host Memory
Pinned Memory
Managed Memory
Unified Memory

Q18
3 Points

For the following code snippet, how many page faults would
occur?

https://www.gradescope.com/courses/461455/assignments/2467151/submissions/151565671 9/20
9/19/23, 11:15 PM View Submission | Gradescope

3
2
1

Q19
2 Points

In Nvidia Collective Communication Library (NCCL) to support


multi-GPU applications, the GPUs communicate over a [____]-
based protocol.

Bus

Tree

Ring

Star

Q20
2 Points

https://www.gradescope.com/courses/461455/assignments/2467151/submissions/151565671 10/20
9/19/23, 11:15 PM View Submission | Gradescope

Which of the following memory types has the fastest access


speed?
Shared Memory
L2 Cache
Global Memory
Register
L1 Cache

Q21
12 Points

Comprehensive Exam Problem - Histogram

Q21.1
4 Points

Consider a Histogram kernel (with 16 thread blocks of 512


threads each) which uses atomicAdd() to update a histogram with
10 bins, histo[10], in global memory.

Now let's consider the case where we privatize this kernel where
every thread block has a private histogram in shared memory,
histo_private[10], where each thread block will first locally update
this private histogram, then update the global histogram when
the thread block completes.

If this privatized Histogram kernel is operating on an array of


10,000 elements, how many atomic operations will be sent to the
global memory?

160

Q21.2
4 Points

https://www.gradescope.com/courses/461455/assignments/2467151/submissions/151565671 11/20
9/19/23, 11:15 PM View Submission | Gradescope

Consider a basic Histogram kernel (with 16 thread blocks of 512


threads each) that uses atomicAdd() to directly update a
histogram with 10 bins, histo[10], in global memory. If the kernel
is operating on an array of 4,096 elements, how many atomic
operations will be sent to the global memory?

16

Q21.3
4 Points

The following code shows a histogram kernel with 2048 bins.

__global__ void histogram(unsigned int* input, unsigned int* histo, unsigned int

__shared__ unsigned int private_histo[num_bins];


int j= threadIdx.x;

while (j < num_bins) {


__syncthreads(); <---------------- syncthreads 1
private_histo[j] = 0;
j ++;
}

__syncthreads(); <---------------- syncthreads 2


int i = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;

while (i < size) {


__syncthreads(); <---------------- syncthreads 3
atomicAdd( &(private_histo[input[i]]), 1 );
i += stride;
}

__syncthreads(); <----------------- syncthreads 4


j = threadIdx.x;

while (j < num_bins) {


__syncthreads(); <---------------- syncthreads 5
atomicAdd(&(histo[j]), private_histo[j] );
j += blockDim.x;
}

https://www.gradescope.com/courses/461455/assignments/2467151/submissions/151565671 12/20
9/19/23, 11:15 PM View Submission | Gradescope

Which __syncthreads() is not required for correct functionality of


the histogram kernel?
syncthreads 5

syncthreads 2

syncthreads 3

syncthreads 1

syncthreads 4

Q22
4 Points

For the following code snippet, which instructions would trigger a


page fault? (Assume Unified Memory behavior on pre-Pascal
architectures.)

https://www.gradescope.com/courses/461455/assignments/2467151/submissions/151565671 13/20
9/19/23, 11:15 PM View Submission | Gradescope

Q23
4 Points

For the following code snippet, which instructions would trigger a


page fault? (Assume Unified Memory behavior on Pascal
architectures and onward.)

https://www.gradescope.com/courses/461455/assignments/2467151/submissions/151565671 14/20
9/19/23, 11:15 PM View Submission | Gradescope

Q24
3 Points

For the below code snippet, which instruction would block the
queues from enabling overlap of data transfer and computation?

F
D
C
A
B
E
H
G

https://www.gradescope.com/courses/461455/assignments/2467151/submissions/151565671 15/20
9/19/23, 11:15 PM View Submission | Gradescope

Q25
5 Points

What are some major performance considerations or design


challenges when programming multi-GPU applications?

(Please keep your answer concise.)

End of Moor's Law which has been slowed down.


Photolithography limitations

Q26
2 Points

Unified memory cannot suffer from page thrashing because the


CPU and GPU shares the same memory space. (Thrashing means
a page consistently migrates back and forth between the CPU
and GPU during the operation of an application.)
False
True

Q27
3 Points

For a vector add program, how many CUDA Streams would be


necessary to enable ideal pipelining such that there is maximum
overlap between H2D, D2H data transfer, and kernel
computation?

Q28
2 Points

Commands (aka Events) in the same CUDA streams can be


processed out-of-order.
https://www.gradescope.com/courses/461455/assignments/2467151/submissions/151565671 16/20
9/19/23, 11:15 PM View Submission | Gradescope

True
False

Q29
2 Points

cudaHostAlloc allocates memory in:


Managed Memory
Pinned Memory
Unified Memory
Global Memory
Host Memory

Q30
2 Points

cudaMalloc allocates memory in:


Unified Memory
Managed Memory
Host Memory
Pinned Memory
Global Memory

https://www.gradescope.com/courses/461455/assignments/2467151/submissions/151565671 17/20
9/19/23, 11:15 PM View Submission | Gradescope

Final Exam  Graded

Student
Sree Charan Reddy Gangireddy

Total Points
77 / 100 pts

Question 1
(no title) 2 / 2 pts

Question 2
(no title) 4 / 4 pts

Question 3
(no title) 4 / 4 pts

Question 4
(no title) 2 / 2 pts

Question 5
(no title) 9 / 12 pts

5.1 (no title) 2 / 2 pts

5.2 (no title) 4 / 4 pts

5.3 (no title) 3 / 3 pts

5.4 (no title) 0 / 3 pts

Question 6
(no title) 2 / 2 pts

Question 7
(no title) 3 / 3 pts

Question 8
(no title) 0 / 2 pts

Question 9
(no title) 2 / 2 pts

Question 10
(no title) 2 / 2 pts

Question 11

https://www.gradescope.com/courses/461455/assignments/2467151/submissions/151565671 18/20
9/19/23, 11:15 PM View Submission | Gradescope

(no title) 3 / 5 pts

Question 12
(no title) 2 / 2 pts

Question 13
(no title) 2 / 2 pts

Question 14
(no title) 0 / 3 pts

Question 15
(no title) 2 / 2 pts

Question 16
(no title) 3 / 3 pts

Question 17
(no title) 2 / 2 pts

Question 18
(no title) 3 / 3 pts

Question 19
(no title) 2 / 2 pts

Question 20
(no title) 2 / 2 pts

Question 21
(no title) 8 / 12 pts

21.1 (no title) 4 / 4 pts

21.2 (no title) 0 / 4 pts

21.3 (no title) 4 / 4 pts

Question 22
(no title) 2 / 4 pts

Question 23
(no title) 4 / 4 pts

Question 24
(no title) 3 / 3 pts

Question 25

https://www.gradescope.com/courses/461455/assignments/2467151/submissions/151565671 19/20
9/19/23, 11:15 PM View Submission | Gradescope

(no title) 0 / 5 pts

Question 26
(no title) 2 / 2 pts

Question 27
(no title) 3 / 3 pts

Question 28
(no title) 2 / 2 pts

Question 29
(no title) 0 / 2 pts

Question 30
(no title) 2 / 2 pts

https://www.gradescope.com/courses/461455/assignments/2467151/submissions/151565671 20/20

You might also like