0% found this document useful (0 votes)

126 views20 pages

GPU - Final - Gradescope

Uploaded by

EMCUBE MELO

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

126 views20 pages

GPU - Final - Gradescope

Uploaded by

EMCUBE MELO

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

9/19/23, 11:15 PM View Submission | Gradescope

Q1
2 Points

Which of the following memory types has the slowest access

speed?
L2 Cache
Shared Memory
Register
L1 Cache
Global Memory

Q2
4 Points

Assume that a kernel is launched with 16 thread blocks, each with

512 threads.
If a variable is declared as a shared memory variable, how many
versions of the variable will be created throughout the lifetime of
the execution of the kernel?

Q3
4 Points

For the below code snippet, which two instructions would

experience overlapped execution?

https://www.gradescope.com/courses/461455/assignments/2467151/submissions/151565671 1/20
9/19/23, 11:15 PM View Submission | Gradescope

Q4
2 Points

In order to launch concurrently running kernels on multiple

GPUs, programmers are required to use CUDA streams.
False
True

https://www.gradescope.com/courses/461455/assignments/2467151/submissions/151565671 2/20
9/19/23, 11:15 PM View Submission | Gradescope

Q5
12 Points

Comprehensive Exam Problem - Matrix Multiply

Q5.1
2 Points

For our tiled matrix multiplication kernel, if we use a 16x16 tile,

what is the reduction of memory bandwidth usage for input
matrices M and N?
~1/32 of the original usage
~1/8 of the original usage
~1/16 of the original usage
~1/64 of the original usage

Q5.2
4 Points

For a tiled single-precision (32-bit) matrix multiplication kernel,

assume that each thread block is 32x32 and the system has a
DRAM burst size of 128 bytes.
How many DRAM bursts will be delivered to the processor as a
result of loading on M-matrix tile by a thread block (during one
phase)?
Keep in mind that each single prediction floating point number is
four bytes.

Q5.3
3 Points

In the basic Matrix Multiply code, if matrix M , N , and P are of size

100x100, how many total bytes of data are transferred from
device to host during the lifetime of execution?

https://www.gradescope.com/courses/461455/assignments/2467151/submissions/151565671 3/20
9/19/23, 11:15 PM View Submission | Gradescope

Assume matrix multiply operates on double-precision (64-bit)

floating point numbers.

80000

Q5.4
3 Points

Assume a tiled matrix multiplication that handles boundary

conditions. Assume that we use 32 X32 tiles to process
rectangular matrices of (1,000 X 2,000) and (2,000 x 2,000).
How many thread blocks are launched?

(2000/32)*(2000/32)

Q6
2 Points

Variables stored in registers are visible to:

All warps in a thread block

All threads in a kernel

A single thread only

All threads in a thread block

Q7
3 Points

For the code snippet below which shows a basic histogram

kernel, how many bins does the histogram histo[] have?

https://www.gradescope.com/courses/461455/assignments/2467151/submissions/151565671 4/20
9/19/23, 11:15 PM View Submission | Gradescope

Q8
2 Points

Which of the following is true?

GPUs do not use program counters to access instruction

All threads in a warp share the same program counter

All warps in an SM share the same program counter

Instructions in a warp are processed out-of-order

Warps consist of multiple thread blocks

Q9
2 Points

https://www.gradescope.com/courses/461455/assignments/2467151/submissions/151565671 5/20
9/19/23, 11:15 PM View Submission | Gradescope

If your CUDA application has a single stream, can you

concurrently copy data between CPU/GPU and execute a kernel
at the same time?
Yes
No

Q10
2 Points

For the code snippet below which shows a basic histogram

kernel, is the access to buffer[] coalesced or not coalesced?

Not Coalesced
Coalesced

Q11
5 Points

What are some factors that are pushing emerging GPU designs
to adopt Multi-Chip Modules? and what are the implications of
https://www.gradescope.com/courses/461455/assignments/2467151/submissions/151565671 6/20
9/19/23, 11:15 PM View Submission | Gradescope

emerging MCM designs?

(Please keep your answer concise.)

In general, an MCM is a microcircuit that has a silicon-to-

substrate density greater than 30%.
The GPU is optimized by slicing the larger chip into smaller
chips using MCM.
Multi-chip module (MCM) graphics cards are the future as
they have a lot of impact on the performance

Q12
2 Points

Which of the following statements is true?

If a pageable data is to be transferred by cudyMemcpy(), it

needs to be first copied to a pinned memory buffer before
transferred.

Data transfer between CUDA device and host is done by

DMA hardware using virtual addresses.

Pinned memory is allocated with cudaMalloc() function.

The OS always guarantees that any memory being used by

DMA hardware is not swapped out.

Q13
2 Points

Each CUDA Stream is a [___] of operations.

https://www.gradescope.com/courses/461455/assignments/2467151/submissions/151565671 7/20
9/19/23, 11:15 PM View Submission | Gradescope

Command
Queue
Heap
Stack
Event

Q14
3 Points

As data is transferred from host to device using cudaMemCopy it

touches various types of hardware components (memories,
buses, on-chip components, etc.). Order the following hardware
components in the order of being used during a cudaMemCopy
H2D operation.

Host Memory -> DMA Engine -> Pinned Memory -> PCIe Bus ->
Global Memory
Pinned Memory -> Host Memory -> PCIe Bus -> DMA Engine ->
Global Memory
Pinned Memory -> Host Memory -> DMA Engine -> PCIe Bus ->
Global Memory
Host Memory -> Pinned Memory -> DMA Engine -> PCIe Bus ->
Global Memory
Host Memory -> Pinned Memory -> PCIe Bus -> DMA Engine ->
Global Memory

Q15
2 Points

Unified memory does not support prefetching of pages between

CPU/GPU.
False
True

https://www.gradescope.com/courses/461455/assignments/2467151/submissions/151565671 8/20
9/19/23, 11:15 PM View Submission | Gradescope

Q16
3 Points

When allocating data with cudaMallocManaged, where are the

memory pages physically allocated? (Assume Unified Memory
behavior on Pascal architectures and onward.)

CPU
No where
GPU

Q17
2 Points

cudaMallocManaged allocates memory to:

Global Memory
Host Memory
Pinned Memory
Managed Memory
Unified Memory

Q18
3 Points

For the following code snippet, how many page faults would
occur?

https://www.gradescope.com/courses/461455/assignments/2467151/submissions/151565671 9/20
9/19/23, 11:15 PM View Submission | Gradescope

3
2
1

Q19
2 Points

In Nvidia Collective Communication Library (NCCL) to support

multi-GPU applications, the GPUs communicate over a [____]-
based protocol.

Bus

Tree

Ring

Star

Q20
2 Points

https://www.gradescope.com/courses/461455/assignments/2467151/submissions/151565671 10/20
9/19/23, 11:15 PM View Submission | Gradescope

Which of the following memory types has the fastest access

speed?
Shared Memory
L2 Cache
Global Memory
Register
L1 Cache

Q21
12 Points

Comprehensive Exam Problem - Histogram

Q21.1
4 Points

Consider a Histogram kernel (with 16 thread blocks of 512

threads each) which uses atomicAdd() to update a histogram with
10 bins, histo[10], in global memory.

Now let's consider the case where we privatize this kernel where
every thread block has a private histogram in shared memory,
histo_private[10], where each thread block will first locally update
this private histogram, then update the global histogram when
the thread block completes.

If this privatized Histogram kernel is operating on an array of

10,000 elements, how many atomic operations will be sent to the
global memory?

160

Q21.2
4 Points

https://www.gradescope.com/courses/461455/assignments/2467151/submissions/151565671 11/20
9/19/23, 11:15 PM View Submission | Gradescope

Consider a basic Histogram kernel (with 16 thread blocks of 512

threads each) that uses atomicAdd() to directly update a
histogram with 10 bins, histo[10], in global memory. If the kernel
is operating on an array of 4,096 elements, how many atomic
operations will be sent to the global memory?

Q21.3
4 Points

The following code shows a histogram kernel with 2048 bins.

__global__ void histogram(unsigned int* input, unsigned int* histo, unsigned int

shared unsigned int private_histo[num_bins];

int j= threadIdx.x;

while (j < num_bins) {

__syncthreads(); <---------------- syncthreads 1
private_histo[j] = 0;
j ++;
}

__syncthreads(); <---------------- syncthreads 2

int i = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;

while (i < size) {

__syncthreads(); <---------------- syncthreads 3
atomicAdd( &(private_histo[input[i]]), 1 );
i += stride;
}

__syncthreads(); <----------------- syncthreads 4

j = threadIdx.x;

while (j < num_bins) {

__syncthreads(); <---------------- syncthreads 5
atomicAdd(&(histo[j]), private_histo[j] );
j += blockDim.x;
}

https://www.gradescope.com/courses/461455/assignments/2467151/submissions/151565671 12/20
9/19/23, 11:15 PM View Submission | Gradescope

Which __syncthreads() is not required for correct functionality of

the histogram kernel?
syncthreads 5

syncthreads 2

syncthreads 3

syncthreads 1

syncthreads 4

Q22
4 Points

For the following code snippet, which instructions would trigger a

page fault? (Assume Unified Memory behavior on pre-Pascal
architectures.)

https://www.gradescope.com/courses/461455/assignments/2467151/submissions/151565671 13/20
9/19/23, 11:15 PM View Submission | Gradescope

Q23
4 Points

For the following code snippet, which instructions would trigger a

page fault? (Assume Unified Memory behavior on Pascal
architectures and onward.)

https://www.gradescope.com/courses/461455/assignments/2467151/submissions/151565671 14/20
9/19/23, 11:15 PM View Submission | Gradescope

Q24
3 Points

For the below code snippet, which instruction would block the
queues from enabling overlap of data transfer and computation?

F
D
C
A
B
E
H
G

https://www.gradescope.com/courses/461455/assignments/2467151/submissions/151565671 15/20
9/19/23, 11:15 PM View Submission | Gradescope

Q25
5 Points

What are some major performance considerations or design

challenges when programming multi-GPU applications?

(Please keep your answer concise.)

End of Moor's Law which has been slowed down.

Photolithography limitations

Q26
2 Points

Unified memory cannot suffer from page thrashing because the

CPU and GPU shares the same memory space. (Thrashing means
a page consistently migrates back and forth between the CPU
and GPU during the operation of an application.)
False
True

Q27
3 Points

For a vector add program, how many CUDA Streams would be

necessary to enable ideal pipelining such that there is maximum
overlap between H2D, D2H data transfer, and kernel
computation?

Q28
2 Points

Commands (aka Events) in the same CUDA streams can be

processed out-of-order.
https://www.gradescope.com/courses/461455/assignments/2467151/submissions/151565671 16/20
9/19/23, 11:15 PM View Submission | Gradescope

True
False

Q29
2 Points

cudaHostAlloc allocates memory in:

Managed Memory
Pinned Memory
Unified Memory
Global Memory
Host Memory

Q30
2 Points

cudaMalloc allocates memory in:

Unified Memory
Managed Memory
Host Memory
Pinned Memory
Global Memory

https://www.gradescope.com/courses/461455/assignments/2467151/submissions/151565671 17/20
9/19/23, 11:15 PM View Submission | Gradescope

Final Exam  Graded

Student
Sree Charan Reddy Gangireddy

Total Points
77 / 100 pts

Question 1
(no title) 2 / 2 pts

Question 2
(no title) 4 / 4 pts

Question 3
(no title) 4 / 4 pts

Question 4
(no title) 2 / 2 pts

Question 5
(no title) 9 / 12 pts

5.1 (no title) 2 / 2 pts

5.2 (no title) 4 / 4 pts

5.3 (no title) 3 / 3 pts

5.4 (no title) 0 / 3 pts

Question 6
(no title) 2 / 2 pts

Question 7
(no title) 3 / 3 pts

Question 8
(no title) 0 / 2 pts

Question 9
(no title) 2 / 2 pts

Question 10
(no title) 2 / 2 pts

Question 11

https://www.gradescope.com/courses/461455/assignments/2467151/submissions/151565671 18/20
9/19/23, 11:15 PM View Submission | Gradescope

(no title) 3 / 5 pts

Question 12
(no title) 2 / 2 pts

Question 13
(no title) 2 / 2 pts

Question 14
(no title) 0 / 3 pts

Question 15
(no title) 2 / 2 pts

Question 16
(no title) 3 / 3 pts

Question 17
(no title) 2 / 2 pts

Question 18
(no title) 3 / 3 pts

Question 19
(no title) 2 / 2 pts

Question 20
(no title) 2 / 2 pts

Question 21
(no title) 8 / 12 pts

21.1 (no title) 4 / 4 pts

21.2 (no title) 0 / 4 pts

21.3 (no title) 4 / 4 pts

Question 22
(no title) 2 / 4 pts

Question 23
(no title) 4 / 4 pts

Question 24
(no title) 3 / 3 pts

Question 25

https://www.gradescope.com/courses/461455/assignments/2467151/submissions/151565671 19/20
9/19/23, 11:15 PM View Submission | Gradescope

(no title) 0 / 5 pts

Question 26
(no title) 2 / 2 pts

Question 27
(no title) 3 / 3 pts

Question 28
(no title) 2 / 2 pts

Question 29
(no title) 0 / 2 pts

Question 30
(no title) 2 / 2 pts

https://www.gradescope.com/courses/461455/assignments/2467151/submissions/151565671 20/20

Advanced Performance Optimization in CUDA (S62192)
No ratings yet
Advanced Performance Optimization in CUDA (S62192)
127 pages
COA Merge
No ratings yet
COA Merge
426 pages
CUDA Memory Architecture: GPGPU Class Week 4
No ratings yet
CUDA Memory Architecture: GPGPU Class Week 4
28 pages
CSED405 Lec3-Memory and Locality - 240912 - 113301
No ratings yet
CSED405 Lec3-Memory and Locality - 240912 - 113301
65 pages
HPCXX 2023 d4
No ratings yet
HPCXX 2023 d4
52 pages
CUDA Memory
No ratings yet
CUDA Memory
56 pages
Chap9 - CUDA Optimization
No ratings yet
Chap9 - CUDA Optimization
73 pages
GPU Programming Basics - Slides
No ratings yet
GPU Programming Basics - Slides
68 pages
CUDA Putting It All Together
No ratings yet
CUDA Putting It All Together
39 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
58 pages
Oopcg Lab
No ratings yet
Oopcg Lab
39 pages
Coursera Quiz Week1 Spring 2014 Heterogeneous Programming
100% (5)
Coursera Quiz Week1 Spring 2014 Heterogeneous Programming
4 pages
Mca 5
No ratings yet
Mca 5
34 pages
CSED405 Lec5-Threads and Atomics - 240921 - 193053
No ratings yet
CSED405 Lec5-Threads and Atomics - 240921 - 193053
34 pages
217 Lec7
No ratings yet
217 Lec7
30 pages
Lecture5 2
No ratings yet
Lecture5 2
46 pages
Parallel-Computing (Set 1) - Merged
No ratings yet
Parallel-Computing (Set 1) - Merged
26 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
BE - HPC - MCQ 1 - 6 Unit
No ratings yet
BE - HPC - MCQ 1 - 6 Unit
45 pages
Multithreaded Architectures: Lecture 5: Performance Considerations
No ratings yet
Multithreaded Architectures: Lecture 5: Performance Considerations
49 pages
Written Asst1
No ratings yet
Written Asst1
31 pages
Coursera Quiz Week3 Fall 2012
100% (1)
Coursera Quiz Week3 Fall 2012
3 pages
ECE408 S19 ZJUI Exam1 Study Guide
No ratings yet
ECE408 S19 ZJUI Exam1 Study Guide
25 pages
02 RTVis GPGPU CUDA
No ratings yet
02 RTVis GPGPU CUDA
34 pages
Laboratory Practice I (410246)
No ratings yet
Laboratory Practice I (410246)
28 pages
Lecture 06
No ratings yet
Lecture 06
26 pages
GPU - Mid - Gradescope
No ratings yet
GPU - Mid - Gradescope
11 pages
Lec6 Cuda Memory
No ratings yet
Lec6 Cuda Memory
18 pages
Written Asst2
No ratings yet
Written Asst2
27 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
DigitalLogic ComputerOrganization L26 Revision
No ratings yet
DigitalLogic ComputerOrganization L26 Revision
13 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
Class 13
No ratings yet
Class 13
19 pages
Class4 Advanced Cuda Opencl
No ratings yet
Class4 Advanced Cuda Opencl
64 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Emebbedd Question Bank
100% (3)
Emebbedd Question Bank
25 pages
Computer Architecture Mcqs
No ratings yet
Computer Architecture Mcqs
10 pages
Multithreaded Architectures: Memory and Data Locality
No ratings yet
Multithreaded Architectures: Memory and Data Locality
39 pages
6963 Midterm Review
No ratings yet
6963 Midterm Review
20 pages
ELEC 4601 Sample-Questions
No ratings yet
ELEC 4601 Sample-Questions
12 pages
Document 90
No ratings yet
Document 90
11 pages
ECE408 2012 Practice Exam1
No ratings yet
ECE408 2012 Practice Exam1
10 pages
Processors
No ratings yet
Processors
25 pages
r20 4-1 Open Elective III Syllabus Final Ws
No ratings yet
r20 4-1 Open Elective III Syllabus Final Ws
29 pages
JAVA - 7 - Multithreading Programming - Doc
0% (1)
JAVA - 7 - Multithreading Programming - Doc
27 pages
CENG443 2023 Final
No ratings yet
CENG443 2023 Final
4 pages
HPC Int2 Key
No ratings yet
HPC Int2 Key
10 pages
Operating System MCQ
No ratings yet
Operating System MCQ
25 pages
323 MT 1
No ratings yet
323 MT 1
3 pages
MC Unit-5
No ratings yet
MC Unit-5
43 pages
tdt4260 May 2013 Final
No ratings yet
tdt4260 May 2013 Final
7 pages
Microprocessors Midterm Exam2019 (Solution)
No ratings yet
Microprocessors Midterm Exam2019 (Solution)
2 pages
GPU Assignment-3 Solution
No ratings yet
GPU Assignment-3 Solution
4 pages
Major Solution
No ratings yet
Major Solution
6 pages
D. Granularity
No ratings yet
D. Granularity
24 pages
C Programming for the Pc the Mac and the Arduino Microcontroller System
From Everand
C Programming for the Pc the Mac and the Arduino Microcontroller System
Peter D Minns
No ratings yet
Exam2 s09 v2
No ratings yet
Exam2 s09 v2
10 pages
Mid Sem QP&Solution
No ratings yet
Mid Sem QP&Solution
7 pages
Assignment Week12
No ratings yet
Assignment Week12
3 pages
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
No ratings yet
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
3 pages
Osc Unit 2
No ratings yet
Osc Unit 2
83 pages
CENG3420 Homework 3: Solutions
No ratings yet
CENG3420 Homework 3: Solutions
5 pages
How To Print Pyramid Pattern in Java - Program Example - Java67
No ratings yet
How To Print Pyramid Pattern in Java - Program Example - Java67
24 pages
GameCube Architecture: Architecture of Consoles: A Practical Analysis, #10
From Everand
GameCube Architecture: Architecture of Consoles: A Practical Analysis, #10
Rodrigo Copetti
No ratings yet
Computing Unit-1 Part1
No ratings yet
Computing Unit-1 Part1
48 pages
Question Paper
No ratings yet
Question Paper
3 pages
What Is Parallel Computing
No ratings yet
What Is Parallel Computing
9 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
Graded Quiz 3
No ratings yet
Graded Quiz 3
15 pages
4278 JNTUA CSE 3-1 R15 Syllabus PDF
No ratings yet
4278 JNTUA CSE 3-1 R15 Syllabus PDF
22 pages
Operating System Theory PDF
No ratings yet
Operating System Theory PDF
127 pages
Syllabus B.E. (Civil Engineering) - Autonomous - All Semester - Grading System - UIT-RGPV BHOPAL
No ratings yet
Syllabus B.E. (Civil Engineering) - Autonomous - All Semester - Grading System - UIT-RGPV BHOPAL
67 pages
Summer Training File Ishika Jain
No ratings yet
Summer Training File Ishika Jain
38 pages
Thread Assignment: Q1.Explain Thread Life Cycle?
No ratings yet
Thread Assignment: Q1.Explain Thread Life Cycle?
5 pages
Class Test - 1 Students OSY Question Bank 2021-22
No ratings yet
Class Test - 1 Students OSY Question Bank 2021-22
3 pages
List Advantages and Disadvantages of Dynamic Memory Allocation vs. Static Memory Allocation.? Advantages
No ratings yet
List Advantages and Disadvantages of Dynamic Memory Allocation vs. Static Memory Allocation.? Advantages
39 pages
CU5092-Real Time Embedded Systems
No ratings yet
CU5092-Real Time Embedded Systems
9 pages
Stanford CS193p: Developing Applications For iOS Fall 2017-18
No ratings yet
Stanford CS193p: Developing Applications For iOS Fall 2017-18
28 pages
OS CourseFile
No ratings yet
OS CourseFile
10 pages
作業系統HW2
No ratings yet
作業系統HW2
7 pages
Chapter 1 Summary Operating System Concepts 9th Edition
No ratings yet
Chapter 1 Summary Operating System Concepts 9th Edition
8 pages
DURGA SOFTWARE SOLUTIONS - Java Contants Topics
No ratings yet
DURGA SOFTWARE SOLUTIONS - Java Contants Topics
1 page
Client Log
No ratings yet
Client Log
66 pages
Minimizing Memory Usage For Creating Application Subprocesses
No ratings yet
Minimizing Memory Usage For Creating Application Subprocesses
4 pages
Lecture 13: Locks: Mythili Vutukuru IIT Bombay
No ratings yet
Lecture 13: Locks: Mythili Vutukuru IIT Bombay
12 pages
Solaris Scheduling: Bongio Jeremy Wenjin Hu
No ratings yet
Solaris Scheduling: Bongio Jeremy Wenjin Hu
46 pages
The Libflame Library For Dense Matrix Computations
No ratings yet
The Libflame Library For Dense Matrix Computations
8 pages
Project (Chandan Boro)
No ratings yet
Project (Chandan Boro)
39 pages
Solaris 10 Advantages
No ratings yet
Solaris 10 Advantages
6 pages
Event Driven
No ratings yet
Event Driven
2 pages

GPU - Final - Gradescope

Uploaded by

GPU - Final - Gradescope

Uploaded by

9/19/23, 11:15 PM View Submission | Gradescope

Which of the following memory types has the slowest access

Assume that a kernel is launched with 16 thread blocks, each with

For the below code snippet, which two instructions would

In order to launch concurrently running kernels on multiple

Comprehensive Exam Problem - Matrix Multiply

For our tiled matrix multiplication kernel, if we use a 16x16 tile,

For a tiled single-precision (32-bit) matrix multiplication kernel,

In the basic Matrix Multiply code, if matrix M , N , and P are of size

Assume matrix multiply operates on double-precision (64-bit)

Assume a tiled matrix multiplication that handles boundary

Variables stored in registers are visible to:

All warps in a thread block

All threads in a kernel

A single thread only

All threads in a thread block

For the code snippet below which shows a basic histogram

Which of the following is true?

GPUs do not use program counters to access instruction

All threads in a warp share the same program counter

All warps in an SM share the same program counter

Instructions in a warp are processed out-of-order

Warps consist of multiple thread blocks

If your CUDA application has a single stream, can you

For the code snippet below which shows a basic histogram

emerging MCM designs?

(Please keep your answer concise.)

In general, an MCM is a microcircuit that has a silicon-to-

Which of the following statements is true?

If a pageable data is to be transferred by cudyMemcpy(), it

Data transfer between CUDA device and host is done by

Pinned memory is allocated with cudaMalloc() function.

The OS always guarantees that any memory being used by

Each CUDA Stream is a [___] of operations.

As data is transferred from host to device using cudaMemCopy it

Unified memory does not support prefetching of pages between

When allocating data with cudaMallocManaged, where are the

cudaMallocManaged allocates memory to:

In Nvidia Collective Communication Library (NCCL) to support

Which of the following memory types has the fastest access

Comprehensive Exam Problem - Histogram

Consider a Histogram kernel (with 16 thread blocks of 512

If this privatized Histogram kernel is operating on an array of

Consider a basic Histogram kernel (with 16 thread blocks of 512

The following code shows a histogram kernel with 2048 bins.

__shared__ unsigned int private_histo[num_bins];

while (j < num_bins) {

__syncthreads(); <---------------- syncthreads 2

while (i < size) {

__syncthreads(); <----------------- syncthreads 4

while (j < num_bins) {

Which __syncthreads() is not required for correct functionality of

For the following code snippet, which instructions would trigger a

For the following code snippet, which instructions would trigger a

What are some major performance considerations or design

(Please keep your answer concise.)

End of Moor's Law which has been slowed down.

Unified memory cannot suffer from page thrashing because the

For a vector add program, how many CUDA Streams would be

Commands (aka Events) in the same CUDA streams can be

cudaHostAlloc allocates memory in:

cudaMalloc allocates memory in:

Final Exam  Graded

5.1 (no title) 2 / 2 pts

5.2 (no title) 4 / 4 pts

5.3 (no title) 3 / 3 pts

5.4 (no title) 0 / 3 pts

(no title) 3 / 5 pts

21.1 (no title) 4 / 4 pts

21.2 (no title) 0 / 4 pts

21.3 (no title) 4 / 4 pts

(no title) 0 / 5 pts

You might also like

shared unsigned int private_histo[num_bins];