0% found this document useful (0 votes)

5 views

CUDA Exercises

The document outlines two homework assignments focused on CUDA programming, including tasks such as creating a 'Hello World' application, vector addition, and matrix multiplication. It provides code skeletons, compilation instructions, and execution commands for running the applications on different systems like OLCF and NERSC. Additionally, it introduces concepts like shared memory in CUDA and encourages students to explore performance improvements in matrix multiplication using shared memory.

Uploaded by

Abhinandan Dash

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

CUDA Exercises

Uploaded by

Abhinandan Dash

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 185

# Homework 1

These exercises will have you write some basic CUDA applications. You
will learn how to allocate GPU memory, move data between the host and
the GPU, and launch kernels.

## 1. Hello World

Your first task is to create a simple hello world application in CUDA. The
code skeleton is already given to you in `hello.cu`. Edit that file, paying
attention to the FIXME locations, so that the output when run is like this:

```
Hello from block: 0, thread: 0
Hello from block: 0, thread: 1
Hello from block: 1, thread: 0
Hello from block: 1, thread: 1
```

(the ordering of the above lines may vary; ordering differences do not
indicate an incorrect result)

Note the use of `cudaDeviceSynchronize()` after the kernel launch. In

CUDA, kernel launches are *asynchronous* to the host thread. The host
thread will launch a kernel but not wait for it to finish, before proceeding
with the next line of host code. Therefore, to prevent application termination
before the kernel gets to print out its message, we must use this
synchronization function.

After editing the code, compile it using the following:

```
module load cuda
nvcc -o hello hello.cu
```
The module load command selects a CUDA compiler for your use. The
module load command only needs to be done once per session/login.
`nvcc` is the CUDA compiler invocation command. The syntax is generally
similar to gcc/g++.

If you have trouble, you can look at `hello_solution.cu` for a complete

example.

To run your code at OLCF on Summit, we will use an LSF command:

```
bsub -W 10 -nnodes 1 -P <allocation_ID> -Is jsrun -n1 -a1 -c1 -g1 ./hello
```

Alternatively, you may want to create an alias for your `bsub` command in
order to make subsequent runs easier:

```
alias lsfrun='bsub -W 10 -nnodes 1 -P <allocation_ID> -Is jsrun -n1 -a1 -c1
-g1'
lsfrun ./hello
```

To run your code at NERSC on Cori, we can use Slurm:

```
module load esslurm
srun -C gpu -N 1 -n 1 -t 10 -A m3502 --gres=gpu:1 -c 10 ./hello
```

Allocation `m3502` is a custom allocation set up on Cori for this training

series, and should be available to participants who registered in advance
until January 18, 2020. If you cannot submit using this allocation, but
already have access to another allocation that grants access to the Cori
GPU nodes, you may use that instead.

If you prefer, you can instead reserve a GPU in an interactive session, and
then run an executable any number of times while the Slurm allocation is
active:

```
salloc -C gpu -N 1 -t 60 -A m3502 --gres=gpu:1 -c 10
srun -n 1 ./hello
```

Note that you only need to `module load esslurm` once per login session;
this is what enables you to submit to the Cori GPU nodes.

## 2. Vector Add

If you're up for a challenge, see if you can write a complete vector add
program from scratch. Or if you prefer, there is a skeleton code given to
you in `vector_add.cu`. Edit the code to build a complete vector_add
program. Compile it and run it similar to the method given in exercise 1.
You can refer to `vector_add_solution.cu` for a complete example.

Note that this skeleton code includes something we didn't cover in lesson 1:
CUDA error checking. Every CUDA runtime API call returns an error code.
It's good practice (especially if you're having trouble) to rigorously check
these error codes. A macro is given that will make this job easier. Note the
special error checking method after a kernel call.

Typical output when complete would look like this:

```
A[0] = 0.840188
B[0] = 0.394383
C[0] = 1.234571
```
## **3. Matrix Multiply (naive)**

A skeleton naive matrix multiply is given to you in `matrix_mul.cu`. See if

you can complete it to get a correct result. If you need help, you can refer
to `matrix_mul_solution.cu`.

This example introduces 2D threadblock/grid indexing, something we did

not cover in lesson 1. If you study the code you will probably be able to see
how it is a structural extension from the 1D case.

This code includes built-in error checking, so a correct result is indicated by

the program.

hello:
#include <stdio.h>

global void hello(){

printf("Hello from block: %u, thread: %u\n", blockIdx.x, threadIdx.x);

}

int main(){
hello<<<2,2>>>();
cudaDeviceSynchronize();
}

matrix_multiplication(naive):
#include <stdio.h>

// these are just for timing measurments

#include <time.h>

// error checking macro

#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
}\
} while (0)

const int DSIZE = 8192;

const int block_size = 32; // CUDA maximum is 1024 *total* threads in
block
const float A_val = 3.0f;
const float B_val = 2.0f;

// matrix multiply (naive) kernel: C = A * B

__global__ void mmul(const float *A, const float *B, float *C, int ds) {

int idx = threadIdx.x+blockDim.x*blockIdx.x; // create thread x index

int idy = threadIdx.y+blockDim.y*blockIdx.y; // create thread y index

if ((idx < ds) && (idy < ds)){

float temp = 0;
for (int i = 0; i < ds; i++)
temp += A[idy*ds+i] * B[i*ds+idx]; // dot product of row and column
C[idy*ds+idx] = temp;
}
}

int main(){

float h_A, h_B, h_C, d_A, d_B, d_C;

// these are just for timing
clock_t t0, t1, t2;
double t1sum=0.0;
double t2sum=0.0;

// start timing
t0 = clock();

h_A = new float[DSIZE*DSIZE];

h_B = new float[DSIZE*DSIZE];
h_C = new float[DSIZE*DSIZE];
for (int i = 0; i < DSIZE*DSIZE; i++){
h_A[i] = A_val;
h_B[i] = B_val;
h_C[i] = 0;}

// Initialization timing
t1 = clock();
t1sum = ((double)(t1-t0))/CLOCKS_PER_SEC;
printf("Init took %f seconds. Begin compute\n", t1sum);

// Allocate device memory and copy input data over to GPU

cudaMalloc(&d_A, DSIZE*DSIZE*sizeof(float));
cudaMalloc(&d_B, DSIZE*DSIZE*sizeof(float));
cudaMalloc(&d_C, DSIZE*DSIZE*sizeof(float));
cudaCheckErrors("cudaMalloc failure");
cudaMemcpy(d_A, h_A, DSIZE*DSIZE*sizeof(float),
cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, DSIZE*DSIZE*sizeof(float),
cudaMemcpyHostToDevice);
cudaCheckErrors("cudaMemcpy H2D failure");

// Cuda processing sequence step 1 is complete

// Launch kernel
dim3 block(block_size, block_size); // dim3 variable holds 3 dimensions
dim3 grid((DSIZE+block.x-1)/block.x, (DSIZE+block.y-1)/block.y);
mmul<<<grid, block>>>(d_A, d_B, d_C, DSIZE);
cudaCheckErrors("kernel launch failure");

// Cuda processing sequence step 2 is complete

// Copy results back to host

cudaMemcpy(h_C, d_C, DSIZE*DSIZE*sizeof(float),
cudaMemcpyDeviceToHost);

// GPU timing
t2 = clock();
t2sum = ((double)(t2-t1))/CLOCKS_PER_SEC;
printf ("Done. Compute took %f seconds\n", t2sum);

// Cuda processing sequence step 3 is complete

// Verify results
cudaCheckErrors("kernel execution failure or cudaMemcpy H2D failure");
for (int i = 0; i < DSIZE*DSIZE; i++) if (h_C[i] != A_val*B_val*DSIZE)
{printf("mismatch at index %d, was: %f, should be: %f\n", i, h_C[i],
A_val*B_val*DSIZE); return -1;}
printf("Success!\n");
return 0;
}

vector_add:
#include <stdio.h>

// error checking macro

const int DSIZE = 4096;

const int block_size = 256; // CUDA maximum is 1024
// vector add kernel: C = A + B
__global__ void vadd(const float *A, const float *B, float *C, int ds){

int idx = threadIdx.x+blockDim.x*blockIdx.x;

if (idx < ds)
C[idx] = A[idx] + B[idx];
}

int main(){

float h_A, h_B, h_C, d_A, d_B, d_C;

h_A = new float[DSIZE];
h_B = new float[DSIZE];
h_C = new float[DSIZE];
for (int i = 0; i < DSIZE; i++){
h_A[i] = rand()/(float)RAND_MAX;
h_B[i] = rand()/(float)RAND_MAX;
h_C[i] = 0;}
cudaMalloc(&d_A, DSIZE*sizeof(float));
cudaMalloc(&d_B, DSIZE*sizeof(float));
cudaMalloc(&d_C, DSIZE*sizeof(float));
cudaCheckErrors("cudaMalloc failure");
cudaMemcpy(d_A, h_A, DSIZE*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, DSIZE*sizeof(float), cudaMemcpyHostToDevice);
cudaCheckErrors("cudaMemcpy H2D failure");
//cuda processing sequence step 1 is complete
vadd<<<(DSIZE+block_size-1)/block_size, block_size>>>(d_A, d_B, d_C,
DSIZE);
cudaCheckErrors("kernel launch failure");
//cuda processing sequence step 2 is complete
cudaMemcpy(h_C, d_C, DSIZE*sizeof(float), cudaMemcpyDeviceToHost);
//cuda processing sequence step 3 is complete
cudaCheckErrors("kernel execution failure or cudaMemcpy H2D failure");
printf("A[0] = %f\n", h_A[0]);
printf("B[0] = %f\n", h_B[0]);
printf("C[0] = %f\n", h_C[0]);
return 0;
}

# Homework 2

These exercises will help reinforce the concept of Shared Memory on the
GPU.

## 1. 1D Stencil Using Shared Memory

Your first task is to create a 1D stencil application that uses shared

memory. The code skeleton is provided in *stencil_1d.cu*. Edit that file,
paying attention to the FIXME locations. The code will verify output and
report any errors.

After editing the code, compile it using the following:

```
module load cuda
nvcc -o stencil_1d stencil_1d.cu
```
The module load command selects a CUDA compiler for your use. The
module load command only needs to be done once per session/login.
*nvcc* is the CUDA compiler invocation command. The syntax is generally
similar to gcc/g++.

To run your code, we will use an LSF command:

```
bsub -W 10 -nnodes 1 -P <allocation_ID> -Is jsrun -n1 -a1 -c1 -g1
./stencil_1d
```

Alternatively, you may want to create an alias for your *bsub* command in
order to make subsequent runs easier:

```
alias lsfrun='bsub -W 10 -nnodes 1 -P <allocation_ID> -Is jsrun -n1 -a1 -c1
-g1'
lsfrun ./stencil_1d
```

To run your code at NERSC on Cori, we can use Slurm:

```
module load esslurm
srun -C gpu -N 1 -n 1 -t 10 -A m3502 --reservation cuda_training
--gres=gpu:1 -c 10 ./stencil_1d
```

Allocation `m3502` is a custom allocation set up on Cori for this training

series, and should be available to participants who registered in advance. If
you cannot submit using this allocation, but already have access to another
allocation that grants access to the Cori GPU nodes (such as m1759), you
may use that instead.
If you prefer, you can instead reserve a GPU in an interactive session, and
then run an executable any number of times while the Slurm allocation is
active (this is recommended if there are enough available nodes):

```
salloc -C gpu -N 1 -t 60 -A m3502 --reservation cuda_training --gres=gpu:1
-c 10
srun -n 1 ./stencil_1d
```

Note that you only need to `module load esslurm` once per login session;
this is what enables you to submit to the Cori GPU nodes.

If you have trouble, you can look at stencil_1d_solution for a complete

example.

## 2. 2D Matrix Multiply Using Shared Memory

Next, let's apply shared memory to the 2D matrix multiply we wrote in

Homework 1. FIXME locations are provided in the code skeleton in
*matrix_mul_shared.cu*. See if you can successfully load the required data
into shared memory and then appropriately update the dot product
calculation. Compile and run your code using the following:

```
module load cuda
nvcc -o matrix_mul matrix_mul_shared.cu
lsfrun ./matrix_mul
```

Note that timing information is included. Go back and run your solution from
Homework 1 and observe the runtime. What runtime impact do you notice
after applying shared memory to this 2D matrix multiply? How does it differ
from the runtime you observed in your previous implementation?
If you have trouble, you can look at *matrix_mul_shared_solution* for a
complete example.

matrix_mul_shared
#include <stdio.h>

// these are just for timing measurments

#include <time.h>

// error checking macro

const int DSIZE = 8192;

const int block_size = 32; // CUDA maximum is 1024 *total* threads in
block
const float A_val = 3.0f;
const float B_val = 2.0f;

// matrix multiply (naive) kernel: C = A * B

__global__ void mmul(const float *A, const float *B, float *C, int ds) {

// declare cache in shared memory

__shared__ float As[block_size][block_size];
__shared__ float Bs[block_size][block_size];

int idx = threadIdx.x+blockDim.x*blockIdx.x; // create thread x index

int idy = threadIdx.y+blockDim.y*blockIdx.y; // create thread y index

if ((idx < ds) && (idy < ds)){

float temp = 0;
for (int i = 0; i < ds/block_size; i++) {

// Load data into shared memory

As[threadIdx.y][threadIdx.x] = A[idy * ds + (i * block_size +
threadIdx.x)];
Bs[threadIdx.y][threadIdx.x] = B[(i * block_size + threadIdx.y) * ds +
idx];

// Synchronize
__syncthreads();

// Keep track of the running sum

for (int k = 0; k < block_size; k++)
temp += As[threadIdx.y][k] * Bs[k][threadIdx.x]; // dot product of row
and column
__syncthreads();

// Write to global memory

C[idy*ds+idx] = temp;
}
}

int main(){

float h_A, h_B, h_C, d_A, d_B, d_C;

// these are just for timing
clock_t t0, t1, t2;
double t1sum=0.0;
double t2sum=0.0;

// start timing
t0 = clock();

h_A = new float[DSIZE*DSIZE];

h_B = new float[DSIZE*DSIZE];
h_C = new float[DSIZE*DSIZE];
for (int i = 0; i < DSIZE*DSIZE; i++){
h_A[i] = A_val;
h_B[i] = B_val;
h_C[i] = 0;}

// Initialization timing
t1 = clock();
t1sum = ((double)(t1-t0))/CLOCKS_PER_SEC;
printf("Init took %f seconds. Begin compute\n", t1sum);

// Allocate device memory and copy input data over to GPU

// Cuda processing sequence step 1 is complete

// Cuda processing sequence step 2 is complete

// Copy results back to host

cudaMemcpy(h_C, d_C, DSIZE*DSIZE*sizeof(float),
cudaMemcpyDeviceToHost);

// GPU timing
t2 = clock();
t2sum = ((double)(t2-t1))/CLOCKS_PER_SEC;
printf ("Done. Compute took %f seconds\n", t2sum);

// Cuda processing sequence step 3 is complete

1d_stencil:
#include <stdio.h>
#include <algorithm>

using namespace std;

#define N 4096
#define RADIUS 3
#define BLOCK_SIZE 16

global void stencil_1d(int in, int out) {

__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
int gindex = threadIdx.x + blockIdx.x * blockDim.x;
int lindex = threadIdx.x + RADIUS;

// Read input elements into shared memory

temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {
temp[lindex - RADIUS] = in[gindex - RADIUS];
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}

// Synchronize (ensure all the data is available)

__syncthreads();

// Apply the stencil

int result = 0;
for (int offset = -RADIUS; offset <= RADIUS; offset++)
result += temp[lindex + offset];

// Store the result

out[gindex] = result;
}

void fill_ints(int *x, int n) {

fill_n(x, n, 1);
}

int main(void) {
int *in, *out; // host copies of a, b, c
int *d_in, *d_out; // device copies of a, b, c
int size = (N + 2*RADIUS) * sizeof(int);
// Alloc space for host copies and setup values
in = (int *)malloc(size); fill_ints(in, N + 2*RADIUS);
out = (int *)malloc(size); fill_ints(out, N + 2*RADIUS);

// Alloc space for device copies

cudaMalloc((void **)&d_in, size);
cudaMalloc((void **)&d_out, size);

// Copy to device
cudaMemcpy(d_in, in, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_out, out, size, cudaMemcpyHostToDevice);

// Launch stencil_1d() kernel on GPU

stencil_1d<<<N/BLOCK_SIZE,BLOCK_SIZE>>>(d_in + RADIUS, d_out +
RADIUS);

// Copy result back to host

cudaMemcpy(out, d_out, size, cudaMemcpyDeviceToHost);

// Error Checking
for (int i = 0; i < N + 2*RADIUS; i++) {
if (i<RADIUS || i>=N+RADIUS){
if (out[i] != 1)
printf("Mismatch at index %d, was: %d, should be: %d\n", i, out[i], 1);
} else {
if (out[i] != 1 + 2*RADIUS)
printf("Mismatch at index %d, was: %d, should be: %d\n", i, out[i], 1 +
2*RADIUS);
}
}

// Cleanup
free(in); free(out);
cudaFree(d_in); cudaFree(d_out);
printf("Success!\n");
return 0;
}

## 1. Vector Add

We'll use a slight variation on the vector add code presented in a previous
homework (*vector_add.cu*). Edit the code to build a complete vector_add
program. You can refer to *vector_add_solution.cu* for a complete
example. For this example, we have made a change to the kernel to use
something called a grid-stride loop. This topic will be dealt with in more
detail in a later training session, but for now we can describe it as a flexible
kernel design method that allows a simple kernel to handle an arbitrary size
data set with an arbitrary size "grid", i.e. the configuration of blocks and
threads associated with the kernel launch. If you'd like to read more about
grid-stride loops right now, you can visit
https://devblogs.nvidia.com/cuda-pro-tip-write-flexible-kernels-grid-stride-lo
ops/

As we will see, this flexibility is important for our investigations in section 2

of this homework session. However, as before, all you need to focus on
are the FIXME items, and these sections will be identical to the work you
did in a previous homework assignment. If you get stuck, you can refer to
the solution *vector_add_solution.cu*.

After editing the code, compile it using the following:

```
module load cuda
nvcc -o vector_add vector_add.cu
```

To run your code, we will use an LSF command:

```
bsub -W 10 -nnodes 1 -P <allocation_ID> -Is jsrun -n1 -a1 -c1 -g1
./vector_add
```

Alternatively, you may want to create an alias for your bsub command in
order to make subsequent runs easier:

```
alias lsfrun='bsub -W 10 -nnodes 1 -P <allocation_ID> -Is jsrun -n1 -a1 -c1
-g1'
lsfrun ./vector_add
```

To run your code at NERSC on Cori, we can use Slurm:

```
module load esslurm
srun -C gpu -N 1 -n 1 -t 10 -A m3502 --reservation cuda_training
--gres=gpu:1 -c 10 ./vector_add
```

Allocation `m3502` is a custom allocation set up on Cori for this training

If you prefer, you can instead reserve a GPU in an interactive session, and
then run an executable any number of times while the Slurm allocation is
active (this is recommended if there are enough available nodes):

```
salloc -C gpu -N 1 -t 60 -A m3502 --reservation cuda_training --gres=gpu:1
-c 10
srun -n 1 ./vector_add
```

Note that you only need to `module load esslurm` once per login session;
this is what enables you to submit to the Cori GPU nodes.

We've also changed the problem size from the previous example, so
correct output should look like this:

```
A[0] = 0.120663
B[0] = 0.615704
C[0] = 0.736367
```

the actual numerical values aren't too important, as long as C[0] = A[0] +
B[0]

## 2. Profiling Experiments

Our objective now will be to explore some of the concepts we learned in the
lesson. In particular we want to see what effect grid sizing (choice of
blocks, and threads per block) have on performance. We could do analysis
like this using host-code-based timing methods, but we'll introduce a new
concept, using a GPU profiler. In a future session, you'll learn more about
the GPU profilers (Nsight Compute and Nsight Systems), but for now we
will use Nsight Compute in a fairly simple fashion to get some basic data
about kernel behavior, to use for comparison.
(If you'd like to read more about the Nsight profilers, you can start here:
https://devblogs.nvidia.com/migrating-nvidia-nsight-tools-nvvp-nvprof/)

First, note that the code has these two lines in it:

```
int blocks = 1; // modify this line for experimentation
int threads = 1; // modify this line for experimentation
```

These lines control the grid sizing. The first variable blocks chooses the
total number of blocks to launch. The second variable threads chooses the
number of threads per block to launch. This second variable must be
constrained to choices between 1 and 1024, inclusive. These are limits
imposed by the GPU hardware.

Let's consider 3 cases. In each case, we will modify the blocks and threads
variables, recompile the code, and then run the code under the Nsight
Compute profiler.

Nsight Compute is installed as part of newer CUDA toolkits (10.1 and

newer), but the path to the command line tool may or may not be set up as
part of your CUDA install. Therefore it may be necessary to specify the
complete command line to access the tool. We will demonstrate that here
with our invocations.

For the following profiler experiments, we will assume you have loaded the
profile module and acquired a node for interactive usage:

```
module load nsight-compute
bsub -W 30 -nnodes 1 -P <allocation_ID> -Is /bin/bash
```

### 2a. 1 block of 1 thread

For this experiment, leave the code as you have created it to complete
exercise 1 above. When running the code you may have noticed it takes a
few seconds to run, however the duration is not particularly long. This
raises the question "how much of that time is the kernel running?" The
profiler can help us answer that question, and we can use this duration (or
various other characteristics) as indicators of "performance" for
comparison. The kernel is designed to do the same set of arithmetic
calculations regardless of the grid sizing choices, so we can say that
shorter kernel duration corresponds to higher performance.

If you'd like to get a basic idea of "typical" profiler output, you could use the
following command:

```
jsrun -n1 -a1 -c1 -g1 nv-nsight-cu-cli ./vector_add
```

However for this 1 block/1 thread test case, the profiler will spend several
minutes assembling the requested set of information. Since our focus is on
kernel duration, we can use a command that allows the profiler to run more
quickly:

```
jsrun -n1 -a1 -c1 -g1 nv-nsight-cu-cli --section SpeedOfLight --section
MemoryWorkloadAnalysis ./vector_add
```

This will allow the profiler to complete its work in under a minute.

We won't parse all the output, but we're interested in these lines:
```
Duration second
2.86
```

and:

```
Memory Throughput Mbyte/second
204.25
```

The above indicate that our kernel took about 3 seconds to run and
achieved around 200MB/s "throughput" i.e. combined read and write
activity, to the GPU memory. A Tesla V100 has around 700-900 GB/s of
available memory throughput, so this code isn't using the available memory
bandwidth very well, amongst other issues. Can we improve the situation
with some changes to our grid sizing?

### 2b. 1 block of 1024 threads

In our training session, we learned that we want "lots of threads". More

specifically we learned that we'd like to deposit as many as 2048 threads
on a single SM, and ideally we'd like to do this across all the SMs in the
GPU. This allows the GPU to do "latency hiding" which we said was very
important for GPU performance, and in the case of this code, the extra
thread behavior will help with memory utilization, as well, as we shall see.
In fact, for this code, "lots of threads/latency hiding" and "efficient use of
memory" are two sides of the same coin. This will become more evident in
the next training session.

So let's take a baby step with our code. Let's change from 1 block of 1
thread to 1 block of 1024 threads. As we've learned, this structure isn't very
good, because it can use at most a single SM on our GPU, but can it
improve performance at all?

Edit the code to make the changes to the threads (1024) variable only.
Leave the blocks variable at 1. Recompile the code and then rerun the
same profiler command. What are the kernel duration and (achieved)
memory throughput now?

(You should now observe a kernel duration that drops from the second
range to the millisecond range, and the memory throughput should now be
in the GB/s instead of MB/s)

### 2c. 160 blocks of 1024 threads

Let's fill the GPU now. We learned that a Tesla V100 has 80 SMs, and
each SM can handle at most 2048 threads. If we create a grid of 160
blocks, each of 1024 threads, this should allow for maximum "occupancy"
of our kernel/grid on the GPU. Make the necessary changes to the blocks
(= 160) variable (the threads variable should already be at 1024 from step
2b), recompile the code, and rerun the profiler command as given in 2a.
What is the performance (kernel duration) and achieved memory
throughput now?

(You should now observe a kernel duration that has dropped to the
microsecond range - ~500us - and a memory throughput that should be
"close" to the peak theoretical of 900GB/s for a Tesla V100).

For the Tesla V100 GPU, this calculation of 80 SMs * 2048 threads/SM =
164K threads is our definition of "lots of threads".

vector_Add:
#include <stdio.h>

// error checking macro

const int DSIZE = 32*1048576;

// vector add kernel: C = A + B
__global__ void vadd(const float *A, const float *B, float *C, int ds){

for (int idx = threadIdx.x+blockDim.x*blockIdx.x; idx < ds;

idx+=gridDim.x*blockDim.x) // a grid-stride loop
C[idx] = A[idx] + B[idx]; // do the vector (element) add here
}

int main(){

float h_A, h_B, h_C, d_A, d_B, d_C;

h_A = new float[DSIZE]; // allocate space for vectors in host memory
h_B = new float[DSIZE];
h_C = new float[DSIZE];
for (int i = 0; i < DSIZE; i++){ // initialize vectors in host memory
h_A[i] = rand()/(float)RAND_MAX;
h_B[i] = rand()/(float)RAND_MAX;
h_C[i] = 0;}
cudaMalloc(&d_A, DSIZE*sizeof(float)); // allocate device space for
vector A
cudaMalloc(&d_B, DSIZE*sizeof(float)); // allocate device space for
vector B
cudaMalloc(&d_C, DSIZE*sizeof(float)); // allocate device space for
vector C
cudaCheckErrors("cudaMalloc failure"); // error checking
// copy vector A to device:
cudaMemcpy(d_A, h_A, DSIZE*sizeof(float), cudaMemcpyHostToDevice);
// copy vector B to device:
cudaMemcpy(d_B, h_B, DSIZE*sizeof(float), cudaMemcpyHostToDevice);
cudaCheckErrors("cudaMemcpy H2D failure");
//cuda processing sequence step 1 is complete
int blocks = 1; // modify this line for experimentation
int threads = 1; // modify this line for experimentation
vadd<<<blocks, threads>>>(d_A, d_B, d_C, DSIZE);
cudaCheckErrors("kernel launch failure");
//cuda processing sequence step 2 is complete
// copy vector C from device to host:
cudaMemcpy(h_C, d_C, DSIZE*sizeof(float), cudaMemcpyDeviceToHost);
//cuda processing sequence step 3 is complete
cudaCheckErrors("kernel execution failure or cudaMemcpy H2D failure");
printf("A[0] = %f\n", h_A[0]);
printf("B[0] = %f\n", h_B[0]);
printf("C[0] = %f\n", h_C[0]);
return 0;
}

## 1. Matrix Row/Column Sums

Your first task is to create a simple matrix row and column sum application
in CUDA. The code skeleton is already given to you in *matrix_sums.cu*.
Edit that file, paying attention to the FIXME locations, so that the output
when run is like this:

```
row sums correct!
column sums correct!
```
After editing the code, compile it using the following:

```
module load cuda
nvcc -o matrix_sums matrix_sums.cu
```

To run your code, we will use an LSF command:

```
bsub -W 10 -nnodes 1 -P <allocation_ID> -Is jsrun -n1 -a1 -c1 -g1
./matrix_sums
```

Alternatively, you may want to create an alias for your bsub command in
order to make subsequent runs easier:

```
alias lsfrun='bsub -W 10 -nnodes 1 -P <allocation_ID> -Is jsrun -n1 -a1 -c1
-g1'
lsfrun ./matrix_sums
```

To run your code at NERSC on Cori, we can use Slurm:

```
module load esslurm
srun -C gpu -N 1 -n 1 -t 10 -A m3502 --reservation cuda_training
--gres=gpu:1 -c 10 ./matrix_sums
```

Allocation `m3502` is a custom allocation set up on Cori for this training

```
salloc -C gpu -N 1 -t 60 -A m3502 --reservation cuda_training --gres=gpu:1
-c 10
srun -n 1 ./matrix_sums
```

Note that you only need to `module load esslurm` once per login session;
this is what enables you to submit to the Cori GPU nodes.

If you have trouble, you can look at matrix_sums_solution.cu for a

complete example.

## **2. Profiling**

We'll introduce something new: the profiler (in this case, Nsight Compute).
We'll use the profiler first to time the kernel execution times, and then to
gather some "metric" information that will possibly shed light on our
observations.

It's necessary to complete task 1 first. Next, load the Nsight Compute
module:
```
module load nsight-compute
```

Then, launch Nsight as follows:

(you may want to make your terminal session wide enough to make the
output easy to read)

```
lsfrun nv-nsight-cu-cli ./matrix_sums
```

What does the output tell you?

Can you locate the lines that identify the kernel durations?
Are the kernel durations the same or different?
Would you expect them to be the same or different?

Next, launch Nsight as follows:

```
lsfrun nv-nsight-cu-cli --metrics
l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum,l1tex__t_requests_pipe
_lsu_mem_global_op_ld.sum ./matrix_sums
```

Our goal is to measure the global memory load efficiency of our kernels. In
this case we have asked for two metrics:
"*l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum*" (the number of
global memory load requests) and
"*l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum*" (the number of
sectors requested for global loads). This first metric above represents the
denominator (requests) of the desired measurement (transactions per
request) and the second metric represents the numerator (transactions).
Dividing these numbers will give us the number of transactions per request.
What similarities or differences do you notice between the *row_sum* and
*column_sum* kernels?
Do the kernels (*row_sum*, *column_sum*) have the same or different
efficiencies?
Why?
How does this correspond to the observed kernel execution times for the
first profiling run?

Can we improve this? (Stay tuned for the next CUDA training session.)

Here is a useful blog to help you get familiar with Nsight Compute:
https://devblogs.nvidia.com/using-nsight-compute-to-inspect-your-kernels/

Matrix_sums:
#include <stdio.h>

// error checking macro

const size_t DSIZE = 16384; // matrix side dimension

const int block_size = 256; // CUDA maximum is 1024

// matrix row-sum kernel

__global__ void row_sums(const float *A, float *sums, size_t ds){
int idx = threadIdx.x+blockDim.x*blockIdx.x; // create typical 1D thread
index from built-in variables
if (idx < ds){
float sum = 0.0f;
for (size_t i = 0; i < ds; i++)
sum += A[idx*ds+i]; // write a for loop that will cause the thread to
iterate across a row, keeeping a running sum, and write the result to sums
sums[idx] = sum;
}}

// matrix column-sum kernel

__global__ void column_sums(const float *A, float *sums, size_t ds){
int idx = threadIdx.x+blockDim.x*blockIdx.x; // create typical 1D thread
index from built-in variables
if (idx < ds){
float sum = 0.0f;
for (size_t i = 0; i < ds; i++)
sum += A[idx+ds*i]; // write a for loop that will cause the thread to
iterate down a column, keeeping a running sum, and write the result to
sums
sums[idx] = sum;
}}

bool validate(float *data, size_t sz){

for (size_t i = 0; i < sz; i++)
if (data[i] != (float)sz) {printf("results mismatch at %lu, was: %f, should
be: %f\n", i, data[i], (float)sz); return false;}
return true;
}

int main(){
float *h_A, *h_sums, *d_A, *d_sums;
h_A = new float[DSIZE*DSIZE]; // allocate space for data in host memory
h_sums = new float[DSIZE]();
for (int i = 0; i < DSIZE*DSIZE; i++) // initialize matrix in host memory
h_A[i] = 1.0f;

cudaMalloc(&d_A, DSIZEDSIZEsizeof(float)); // allocate device space

for A
cudaMalloc(&d_sums, DSIZE*sizeof(float)); // allocate device space for
vector d_sums
cudaCheckErrors("cudaMalloc failure"); // error checking

// copy matrix A to device:

cudaMemcpy(d_A, h_A, DSIZE*DSIZE*sizeof(float),
cudaMemcpyHostToDevice);
cudaCheckErrors("cudaMemcpy H2D failure");
//cuda processing sequence step 1 is complete

row_sums<<<(DSIZE+block_size-1)/block_size, block_size>>>(d_A,
d_sums, DSIZE);
cudaCheckErrors("kernel launch failure");
//cuda processing sequence step 2 is complete

// copy vector sums from device to host:

cudaMemcpy(h_sums, d_sums, DSIZE*sizeof(float),
cudaMemcpyDeviceToHost);
//cuda processing sequence step 3 is complete

cudaCheckErrors("kernel execution failure or cudaMemcpy H2D failure");

if (!validate(h_sums, DSIZE)) return -1;
printf("row sums correct!\n");

cudaMemset(d_sums, 0, DSIZE*sizeof(float));

column_sums<<<(DSIZE+block_size-1)/block_size, block_size>>>(d_A,
d_sums, DSIZE);
cudaCheckErrors("kernel launch failure");
//cuda processing sequence step 2 is complete
// copy vector sums from device to host:
cudaMemcpy(h_sums, d_sums, DSIZE*sizeof(float),
cudaMemcpyDeviceToHost);
//cuda processing sequence step 3 is complete

cudaCheckErrors("kernel execution failure or cudaMemcpy H2D failure");

if (!validate(h_sums, DSIZE)) return -1;
printf("column sums correct!\n");
return 0;
}

## 1. Comparing Reductions

For your first task, the code is already written for you. We will compare 3 of
the reductions given during the presentation: the naive atomic-only
reduction, the classical parallel reduction with atomic finish, and the warp
shuffle reduction (with atomic finish).

Compile it using the following:

```
module load cuda
nvcc -o reductions reductions.cu
```

```
module load nsight-compute
```
To run your code, we will use an LSF command:

```
bsub -W 10 -nnodes 1 -P <allocation_ID> -Is jsrun -n1 -a1 -c1 -g1
nv-nsight-cu-cli ./reductions
```

Alternatively, you may want to create an alias for your *bsub* command in
order to make subsequent runs easier:

```
alias lsfrun='bsub -W 10 -nnodes 1 -P <allocation_ID> -Is jsrun -n1 -a1 -c1
-g1'
lsfrun nv-nsight-cu-cli ./reductions
```

To run your code at NERSC on Cori, we can use Slurm:

```
module load esslurm
srun -C gpu -N 1 -n 1 -t 10 -A m3502 --reservation cuda_training
--gres=gpu:1 -c 10 ./reductions
```

Allocation `m3502` is a custom allocation set up on Cori for this training

```
salloc -C gpu -N 1 -t 60 -A m3502 --reservation cuda_training --gres=gpu:1
-c 10
srun -n 1 ./reductions
```

Note that you only need to `module load esslurm` once per login session;
this is what enables you to submit to the Cori GPU nodes.

This will run the code with the profiling in its most basic mode, which is
sufficient. We want to compare kernel execution times. What do you notice
about kernel execution times? Probably, you won't see much difference
between the parallel reduction with atomics and the warp shuffle with
atomics kernel. Can you theorize why this may be? Our objective with
these will be to approach theoretical limits. The theoretical limit for a typical
reduction would be determined by the memory bandwidth of the GPU. To
calculate the attained memory bandwidth of this kernel, divide the total data
size in bytes (use N from the code in your calculation) by the execution
time (which you can get from the profiler). How does this number compare
to the memory bandwidth of the GPU you are running on? (You could run
bandwidthTest sample code to get a proxy/estimate).

Now edit the code to change *N* from ~8M to 163840 (=640*256)

Recompile and re-run the code with profiling. Is there a bigger percentage
difference between the execution time of the reduce_a and reduce_ws
kernel? Why might this be?

Bonus: edit the code to change *N* from ~8M to ~32M. recompile and run.
What happened? Why?

## 2. Create a different reduction (besides sum)

For this exercise, you are given a fully-functional sum-reduction code,

similar to the code used for exercise 1 above, except that we will use the
2-stage reduction method without atomic finish. If you wish you can compile
and run it as-is to see how it works. Your task is to modify it (*only the
kernel*) so that it creates a proper max-finding reduction. That means that
the kernel should report the maximum value in the data set, rather than the
sum of the data set. You are expected to use a similar
parallel-sweep-reduction technique. If you need help, refer to the solution.

```
nvcc -o max_reduction max_reduction.cu
lsfrun ./max_reduction
```

## 3. Revisit row_sums from hw4

For this exercise, start with the *matrix_sums.cu* code from hw4. As you
may recall, the *row_sums* kernel was reading the same data set as the
*column_sums* kernel, but running noticeably slower. We now have some
ideas how to fix it. See if you can implement a reduction-per-row, to allow
the row-sum kernel to approach the performance of the column sum kernel.
There are probably several ways to tackle this problem. To see one
approach, refer to the solution.

You can start just by compiling the code as-is and running the profiler to
remind yourself of the performance (discrepancy).

Compile the code and profile it using Nsight Compute:

```
nvcc -o matrix_sums matrix_sums.cu
lsfrun nv-nsight-cu-cli ./matrix_sums
```

Remember from the previous session our top 2 CUDA optimization

priorities: lots of threads and efficient use of the memory subsystem. The
original row_sums kernel definitely misses the mark for the memory
objective. What we've learned about reductions should guide you. There
are probably several ways to tackle this:

- Write a straightforward parallel reduction, run it on a row, and use a

for-loop to loop the kernel over all rows
- Assign a warp to each row, to perform everything in one kernel call
- Assign a threadblock to each row, to perform everything in one kernel call
- ??

Since the (given) solution may be somewhat unusual, I'll give some hints
here if needed:

- The chosen strategy will be to assign one block per row

- We must modify the kernel launch to launch exactly as many blocks as
we have rows
- The kernel can be adapted from the reduction kernel (atomic is not
needed here) from the reduce kernel code in exercise 1 above.
- Since we are assigning one block per row, we will cause each block to
perform a block-striding loop, to traverse the row. This is conceptually
similar to a grid striding loop, except each block is striding individually, one
per row. Refresh your memory of the grid-stride loop, and see if you can
work this out.
- With the block-stride loop, you'll need to think carefully about indexing

After you have completed the work and are getting a successful result,
profile the code again to see if the performance of the row_sums kernel has
improved:

```
nvcc -o matrix_sums matrix_sums.cu
lsfrun nv-nsight-cu-cli ./matrix_sums
```

Your actual performance here (compared to the fairly efficient

column_sums kernel) will probably depend quite a bit on the
algorithm/method you choose. See if you can theorize how the various
choices may affect efficiency or optimality. If you end up with a solution
where the row_sums kernel actually runs faster than the column_sums
kernel, see if you can theorize why this may be. Remember the two CUDA
optimization priorities, and use these to guide your thinking.

#include <stdio.h>

// error checking macro

const size_t DSIZE = 16384; // matrix side dimension

const int block_size = 256; // CUDA maximum is 1024

// matrix row-sum kernel

// we will assign one block per row
__global__ void row_sums(const float *A, float *sums, size_t ds){

int idx = blockIdx.x; // our block index becomes our row indicator
if (idx < ds){
__shared__ float sdata[block_size];
int tid = threadIdx.x;
sdata[tid] = 0.0f;
size_t tidx = tid;
while (tidx < ds) { // block stride loop to load data
sdata[tid] += A[idx*ds+tidx];
tidx += blockDim.x;
}

for (unsigned int s=blockDim.x/2; s>0; s>>=1) {

__syncthreads();
if (tid < s) // parallel sweep reduction
sdata[tid] += sdata[tid + s];
}
if (tid == 0) sums[idx] = sdata[0];
}
}

// matrix column-sum kernel

__global__ void column_sums(const float *A, float *sums, size_t ds){

int idx = threadIdx.x+blockDim.x*blockIdx.x; // create typical 1D thread

index from built-in variables
if (idx < ds){
float sum = 0.0f;
for (size_t i = 0; i < ds; i++)
sum += A[idx+ds*i]; // write a for loop that will cause the thread to
iterate down a column, keeeping a running sum, and write the result to
sums
sums[idx] = sum;
}}
bool validate(float *data, size_t sz){
for (size_t i = 0; i < sz; i++)
if (data[i] != (float)sz) {printf("results mismatch at %lu, was: %f, should
be: %f\n", i, data[i], (float)sz); return false;}
return true;
}
int main(){
float *h_A, *h_sums, *d_A, *d_sums;
h_A = new float[DSIZE*DSIZE]; // allocate space for data in host memory
h_sums = new float[DSIZE]();
for (int i = 0; i < DSIZE*DSIZE; i++) // initialize matrix in host memory
h_A[i] = 1.0f;
cudaMalloc(&d_A, DSIZE*DSIZE*sizeof(float)); // allocate device space
for A
cudaMalloc(&d_sums, DSIZE*sizeof(float)); // allocate device space for
vector d_sums
cudaCheckErrors("cudaMalloc failure"); // error checking
// copy matrix A to device:
cudaMemcpy(d_A, h_A, DSIZE*DSIZE*sizeof(float),
cudaMemcpyHostToDevice);
cudaCheckErrors("cudaMemcpy H2D failure");
//cuda processing sequence step 1 is complete
row_sums<<<DSIZE, block_size>>>(d_A, d_sums, DSIZE);
cudaCheckErrors("kernel launch failure");
//cuda processing sequence step 2 is complete
// copy vector sums from device to host:
cudaMemcpy(h_sums, d_sums, DSIZE*sizeof(float),
cudaMemcpyDeviceToHost);
//cuda processing sequence step 3 is complete
cudaCheckErrors("kernel execution failure or cudaMemcpy H2D failure");
if (!validate(h_sums, DSIZE)) return -1;
printf("row sums correct!\n");
cudaMemset(d_sums, 0, DSIZE*sizeof(float));
column_sums<<<(DSIZE+block_size-1)/block_size, block_size>>>(d_A,
d_sums, DSIZE);
cudaCheckErrors("kernel launch failure");
//cuda processing sequence step 2 is complete
// copy vector sums from device to host:
cudaMemcpy(h_sums, d_sums, DSIZE*sizeof(float),
cudaMemcpyDeviceToHost);
//cuda processing sequence step 3 is complete
cudaCheckErrors("kernel execution failure or cudaMemcpy H2D failure");
if (!validate(h_sums, DSIZE)) return -1;
printf("column sums correct!\n");
return 0;
}

Max-reduction:
#include <stdio.h>

// error checking macro

const size_t N = 8ULL1024ULL1024ULL; // data size

const int BLOCK_SIZE = 256; // CUDA maximum is 1024

global void reduce(float gdata, float out, size_t n){

__shared__ float sdata[BLOCK_SIZE];
int tid = threadIdx.x;
sdata[tid] = 0.0f;
size_t idx = threadIdx.x+blockDim.x*blockIdx.x;

while (idx < n) { // grid stride loop to load data

sdata[tid] = max(gdata[idx], sdata[tid]);
idx += gridDim.x*blockDim.x;
}

for (unsigned int s=blockDim.x/2; s>0; s>>=1) {

__syncthreads();
if (tid < s) // parallel sweep reduction
sdata[tid] = max(sdata[tid + s], sdata[tid]);
}
if (tid == 0) out[blockIdx.x] = sdata[0];
}

int main(){

float h_A, h_sum, d_A, d_sums;

const int blocks = 640;
h_A = new float[N]; // allocate space for data in host memory
h_sum = new float;
float max_val = 5.0f;
for (size_t i = 0; i < N; i++) // initialize matrix in host memory
h_A[i] = 1.0f;
h_A[100] = max_val;
cudaMalloc(&d_A, N*sizeof(float)); // allocate device space for A
cudaMalloc(&d_sums, blocks*sizeof(float)); // allocate device space for
partial sums
cudaCheckErrors("cudaMalloc failure"); // error checking
// copy matrix A to device:
cudaMemcpy(d_A, h_A, N*sizeof(float), cudaMemcpyHostToDevice);
cudaCheckErrors("cudaMemcpy H2D failure");
//cuda processing sequence step 1 is complete
reduce<<<blocks, BLOCK_SIZE>>>(d_A, d_sums, N); // reduce stage 1
cudaCheckErrors("reduction kernel launch failure");
reduce<<<1, BLOCK_SIZE>>>(d_sums, d_A, blocks); // reduce stage 2
cudaCheckErrors("reduction kernel launch failure");
//cuda processing sequence step 2 is complete
// copy vector sums from device to host:
cudaMemcpy(h_sum, d_A, sizeof(float), cudaMemcpyDeviceToHost);
//cuda processing sequence step 3 is complete
cudaCheckErrors("reduction w/atomic kernel execution failure or
cudaMemcpy D2H failure");
printf("reduction output: %f, expected sum reduction output: %f, expected
max reduction output: %f\n", *h_sum, (float)((N-1)+max_val), max_val);
return 0;
}

# Homework 6

These exercises will have you use Unified Memory to utilize GPUs on
non-trivial data structures.

## 1. Porting Linked Lists to GPUs

For your first task, you are given a code that assembles a linked list on the
CPU, and then attempts to print an element from the list. Your task is to
modify the code using UM techniques, so that the linked list can be
correctly traversed either from CPU code or from GPU code. Hint: there is
only one line in the file that needs to be modified to do this exercise.

Compile it using the following:

```
module load cuda
nvcc -o linked_list linked_list.cu
```

To run your code, we will use an LSF command:

```
bsub -W 10 -nnodes 1 -P <allocation_ID> -Is jsrun -n1 -a1 -c1 -g1
./linked_list
```

Alternatively, you may want to create an alias for your bsub command in
order to make subsequent runs easier:

```
alias lsfrun='bsub -W 10 -nnodes 1 -P <allocation_ID> -Is jsrun -n1 -a1 -c1
-g1'
lsfrun ./linked_list
```

To run your code at NERSC on Cori, we can use Slurm:

```
module load esslurm
srun -C gpu -N 1 -n 1 -t 10 -A m3502 --gres=gpu:1 -c 10 ./linked_list
```

Allocation `m3502` is a custom allocation set up on Cori for this training

```
salloc -C gpu -N 1 -t 60 -A m3502 --gres=gpu:1 -c 10
srun -n 1 ./linked_list
```
Note that you only need to `module load esslurm` once per login session;
this is what enables you to submit to the Cori GPU nodes.

Correct output should look like this:

```
key = 3
key = 3
```

If you need help, refer to linked_list_solution.cu

## 2. Array Increment

In this exercise, you are given a code that increments a large array on the
GPU.

a. First, compile and profile the code as-is:

```
module load nsight-systems
nvcc -o array_inc array_inc.cu
lsfrun nsys profile --stats=true ./array_inc
```

Make a note of the kernel execution duration.

b. Now, modify the code to use managed memory. Replace the malloc
operations with cudaMallocManaged, and eliminate the cudaMemcpy
operations. Do you need to replace the *cudaMemcpy* operation from
device to host with a *cudaDeviceSynchronize()*? Why? Now, compile and
profile the code again. Compare the kernel execution duration to the
previous result. Note the profiler indication of CPU and GPU page faults.
c. Now, modify the code to insert prefetching of the array to the GPU
immediately before the kernel call, and back to the CPU immediately after
the kernel call. Compile and profile the code again. Compare the kernel
execution time to the previous results. Are there still any page faults? Why?

d. Bonus: Modify the code to run the *inc()* kernel 10000 times in a row
instead of just once. What can be said about the impact of memory
operations on our runtime? What would this suggest for a real-world
application?

If you need help, refer to the array_inc_solution.cu.

Aray_inc:
#include <cstdio>
#include <cstdlib>
// error checking macro
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
}\
} while (0)

template <typename T>

void alloc_bytes(T &ptr, size_t num_bytes){

cudaMallocManaged(&ptr, num_bytes);
}
__global__ void inc(int *array, size_t n){
size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
while (idx < n){
array[idx]++;
idx += blockDim.x*gridDim.x; // grid-stride loop
}
}

const size_t ds = 32ULL1024ULL1024ULL;

int main(){

int *h_array;
alloc_bytes(h_array, ds*sizeof(h_array[0]));
cudaCheckErrors("cudaMallocManaged Error");
memset(h_array, 0, ds*sizeof(h_array[0]));
cudaMemPrefetchAsync(h_array, ds*sizeof(h_array[0]), 0); // add in step
2c
inc<<<256, 256>>>(h_array, ds);
cudaCheckErrors("kernel launch error");
cudaMemPrefetchAsync(h_array, ds*sizeof(h_array[0]),
cudaCpuDeviceId); // add in step 2c
cudaDeviceSynchronize();
cudaCheckErrors("kernel execution error");
for (int i = 0; i < ds; i++)
if (h_array[i] != 1) {printf("mismatch at %d, was: %d, expected: %d\n", i,
h_array[i], 1); return -1;}
printf("success!\n");
return 0;
}

Linked_list:
#include <cstdio>
#include <cstdlib>
// error checking macro
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
}\
} while (0)

struct list_elem {
int key;
list_elem *next;
};

template <typename T>

void alloc_bytes(T &ptr, size_t num_bytes){

cudaMallocManaged(&ptr, num_bytes);
}

__host__ __device__
void print_element(list_elem *list, int ele_num){
list_elem *elem = list;
for (int i = 0; i < ele_num; i++)
elem = elem->next;
printf("key = %d\n", elem->key);
}

global void gpu_print_element(list_elem *list, int ele_num){

print_element(list, ele_num);
}
const int num_elem = 5;
const int ele = 3;
int main(){

list_elem list_base, list;

alloc_bytes(list_base, sizeof(list_elem));
list = list_base;
for (int i = 0; i < num_elem; i++){
list->key = i;
alloc_bytes(list->next, sizeof(list_elem));
list = list->next;}
print_element(list_base, ele);
gpu_print_element<<<1,1>>>(list_base, ele);
cudaDeviceSynchronize();
cudaCheckErrors("cuda error!");
}

## 1. Investigating Copy-Compute Overlap

For your first task, you are given a code that performs a silly computation
element-wise on a vector. You can initially compile, run and profile the code
if you wish.

compile it using the following:

```
module load cuda
nvcc -o overlap overlap.cu
```

```
bsub -W 10 -nnodes 1 -P <allocation_ID> -Is jsrun -n1 -a1 -c1 -g1 ./overlap
```

Alternatively, you may want to create an alias for your bsub command in
order to make subsequent runs easier:

```
alias lsfrun='bsub -W 10 -nnodes 1 -P <allocation_ID> -Is jsrun -n1 -a1 -c1
-g1'
lsfrun ./overlap
```

To run your code at NERSC on Cori, we can use Slurm:

```
module load esslurm
srun -C gpu -N 1 -n 1 -t 10 -A m3502 -G 1 -c 10 ./overlap
```

Allocation `m3502` is a custom allocation set up on Cori for this training

```
salloc -C gpu -N 1 -t 60 -A m3502 -G 1 -c 10
srun -n 1 ./overlap
```

Note that you only need to `module load esslurm` once per login session;
this is what enables you to submit to the Cori GPU nodes.

In this case, the output will show the elapsed time of the non-overlapped
version of the code. This code copies the entire vector to the device, then
launches the processing kernel, then copies the entire vector back to the
host.

You can also run this code with Nsight Systems if you wish:

```
module load nsight-systems
lsfrun nsys profile -o <destination_dir>/overlap.qdrep ./overlap
```

Note that you will have to copy this file over to your local machine and
install Nsight Systems for visualization. You can download Nsight Systems
here:
https://developer.nvidia.com/nsight-systems

This visual output should show you the sequence of operations

(*cudaMemcpy* Host to Device, kernel call, and *cudaMemcpy* Device To
Host). Note that there is an initial "warm-up" run of the kernel; disregard
this. You should be able to witness that the start and duration of each
operating is indicating that there is no overlap.

Your objective is to create a fully overlapped version of the code. Use your
knowledge of streams to create a version of the code that will issue the
work in chunks, and for each chunk perform the copy to device, kernel
launch, and copy to host in a single stream, then modifying the stream for
the next chunk. The work has been started for you in the section of code
after the #ifdef statement. Look for the FIXME tokens there, and replace
each FIXME with appropriate code to complete this task.

When you have something ready to test, compile with this additional switch:

```
nvcc -o overlap overlap.cu -DUSE_STREAMS
```

If you run the code, there will be a verification check performed, to make
sure you have processed the entire vector correctly, in chunks. If you pass
the verification test, the program will display the elapsed time of the
streamed version. You should be able to get to at least 2X faster (i.e. half
the duration) of the non-streamed version. If you wish, you can also run this
code with the Nsight Systems profiler using the above given command.
This will generate a visual output, and you should be able to confirm that
there is indeed overlap of operations by zooming in on the portion of
execution related to kernel launches. You can see the non-overlapped
version run, followed by the overlapped version. Not only should the
overlapped version be faster, you should see an interleaving of
computation and data transfer operations.

If you need help, refer to overlap_solution.cu.

## 2. Simple Multi-GPU

In this exercise, you are given a very simple code that performs 4 kernel
calls in sequence on a single GPU. You're welcome to compile and run the
code as-is. It will display an overall duration for the time taken to complete
the 4 kernel calls. Your task is to modify this code to run each kernel on a
separate GPU (each node on Summit actually has 6 GPUs). After
completion, confirm that the execution time is substantially reduced.

You can compile the code with:

```
nvcc -o multi multi.cu
```

You can run the code on Summit with:

```
alias lsfrun='bsub -W 10 -nnodes 1 -P <allocation_ID> -Is jsrun -n1 -a1 -c1
-g4'
lsfrun ./multi
```

On Cori, make sure that you ask for an allocation with 4 GPUs, e.g.

```
srun -C gpu -N 1 -n 1 -t 10 -A m3502 -G 4 -c 40 ./multi
```

**HINT**: This exercise might be simpler than you think. You won't need to
do anything with streams at all for this. You'll only need to make a simple
modification to each of the for-loops.

If you need help, refer to multi_solution.cu.

#include <math.h>
#include <iostream>
#include <time.h>
#include <sys/time.h>
#include <stdio.h>

// modifiable
typedef float ft;
const int chunks = 64;
const size_t ds = 1024*1024*chunks;
const int count = 22;
const int num_gpus = 4;
// not modifiable
const float sqrt_2PIf = 2.5066282747946493232942230134974f;
const double sqrt_2PI = 2.5066282747946493232942230134974;
__device__ float gpdf(float val, float sigma) {
return expf(-0.5f * val * val) / (sigma * sqrt_2PIf);
}

device double gpdf(double val, double sigma) {

return exp(-0.5 * val * val) / (sigma * sqrt_2PI);
}

// error check macro

// host-based timing
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start) {

timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

int main() {
ft *h_x, *d_x[num_gpus], *d_y[num_gpus];
h_x = (ft *)malloc(ds * sizeof(ft));

for (int i = 0; i < num_gpus; i++) {

cudaMalloc(&d_x[i], ds * sizeof(ft));
cudaMalloc(&d_y[i], ds * sizeof(ft));
}
cudaCheckErrors("allocation error");

for (int i = 0; i < num_gpus; i++) {

for (size_t j = 0; j < ds; j++) {
h_x[j] = rand() / (ft)RAND_MAX;
}
cudaMemcpy(d_x[i], h_x, ds * sizeof(ft), cudaMemcpyHostToDevice);
}
cudaCheckErrors("copy error");

unsigned long long et1 = dtime_usec(0);

for (int i = 0; i < num_gpus; i++) {
gaussian_pdf<<<(ds+255)/256, 256>>>(d_x[i], d_y[i], 0.0, 1.0, ds);
}
cudaDeviceSynchronize();
cudaCheckErrors("execution error");

et1 = dtime_usec(et1);
std::cout << "elapsed time: " << et1/(float)USECPSEC << std::endl;

return 0;
}
#include <math.h>
#include <iostream>
#include <time.h>
#include <sys/time.h>
#include <stdio.h>

// modifiable
typedef float ft;
const int chunks = 64;
const size_t ds = 1024*1024*chunks;
const int count = 22;
const int num_gpus = 4;

// not modifiable
const float sqrt_2PIf = 2.5066282747946493232942230134974f;
const double sqrt_2PI = 2.5066282747946493232942230134974;
__device__ float gpdf(float val, float sigma) {
return expf(-0.5f * val * val) / (sigma * sqrt_2PIf);
}

device double gpdf(double val, double sigma) {

return exp(-0.5 * val * val) / (sigma * sqrt_2PI);
}
// compute average gaussian pdf value over a window around each point
__global__ void gaussian_pdf(const ft * __restrict__ x, ft * __restrict__ y,
const ft mean, const ft sigma, const int n) {
int idx = threadIdx.x + blockDim.x * blockIdx.x;
if (idx < n) {
ft in = x[idx] - (count / 2) * 0.01f;
ft out = 0;
for (int i = 0; i < count; i++) {
ft temp = (in - mean) / sigma;
out += gpdf(temp, sigma);
in += 0.01f;
}
y[idx] = out / count;
}
}

// error check macro

// host-based timing
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start) {

timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

int main() {
ft *h_x, *d_x[num_gpus], *d_y[num_gpus];
h_x = (ft *)malloc(ds * sizeof(ft));

for (int i = 0; i < num_gpus; i++) {

cudaSetDevice(i);
cudaMalloc(&d_x[i], ds * sizeof(ft));
cudaMalloc(&d_y[i], ds * sizeof(ft));
}
cudaCheckErrors("allocation error");

for (int i = 0; i < num_gpus; i++) {

for (size_t j = 0; j < ds; j++) {
h_x[j] = rand() / (ft)RAND_MAX;
}
cudaSetDevice(i);
cudaMemcpy(d_x[i], h_x, ds * sizeof(ft), cudaMemcpyHostToDevice);
}
cudaCheckErrors("copy error");

unsigned long long et1 = dtime_usec(0);

for (int i = 0; i < num_gpus; i++) {

cudaSetDevice(i);
gaussian_pdf<<<(ds+255)/256, 256>>>(d_x[i], d_y[i], 0.0, 1.0, ds);
}
cudaDeviceSynchronize();
cudaCheckErrors("execution error");

et1 = dtime_usec(et1);
std::cout << "elapsed time: " << et1/(float)USECPSEC << std::endl;
return 0;
}
#include <math.h>
#include <iostream>
#include <time.h>
#include <sys/time.h>
#include <stdio.h>

// modifiable
typedef float ft;
const int chunks = 64;
const size_t ds = 1024*1024*chunks;
const int count = 22;
const int num_streams = 8;

device double gpdf(double val, double sigma) {

return exp(-0.5 * val * val) / (sigma * sqrt_2PI);
}

// error check macro

// host-based timing
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start) {

timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

int main() {
ft *h_x, *d_x, *h_y, *h_y1, *d_y;
cudaHostAlloc(&h_x, ds*sizeof(ft), cudaHostAllocDefault);
cudaHostAlloc(&h_y, ds*sizeof(ft), cudaHostAllocDefault);
cudaHostAlloc(&h_y1, ds*sizeof(ft), cudaHostAllocDefault);
cudaMalloc(&d_x, ds*sizeof(ft));
cudaMalloc(&d_y, ds*sizeof(ft));
cudaCheckErrors("allocation error");

cudaStream_t streams[num_streams];
for (int i = 0; i < num_streams; i++) {
cudaStreamCreate(&streams[i]);
}
cudaCheckErrors("stream creation error");

gaussian_pdf<<<(ds + 255) / 256, 256>>>(d_x, d_y, 0.0, 1.0, ds); //

warm-up

for (size_t i = 0; i < ds; i++) {

h_x[i] = rand() / (ft)RAND_MAX;
}
cudaDeviceSynchronize();

unsigned long long et1 = dtime_usec(0);

cudaMemcpy(d_x, h_x, ds * sizeof(ft), cudaMemcpyHostToDevice);

gaussian_pdf<<<(ds + 255) / 256, 256>>>(d_x, d_y, 0.0, 1.0, ds);
cudaMemcpy(h_y1, d_y, ds * sizeof(ft), cudaMemcpyDeviceToHost);
cudaCheckErrors("non-streams execution error");

et1 = dtime_usec(et1);
std::cout << "non-stream elapsed time: " << et1/(float)USECPSEC <<
std::endl;

#ifdef USE_STREAMS
cudaMemset(d_y, 0, ds * sizeof(ft));

unsigned long long et = dtime_usec(0);

for (int i = 0; i < chunks; i++) { //depth-first launch

cudaMemcpyAsync(d_x + FIXME, h_x + FIXME, (FIXME) * sizeof(ft),
cudaMemcpyHostToDevice, streams[FIXME]);
gaussian_pdf<<<((FIXME) + 255) / 256, 256, 0, streams[FIXME]>>>(d_x
+ FIXME, d_y + FIXME, 0.0, 1.0, FIXME);
cudaMemcpyAsync(h_y + i * (ds / chunks), d_y + i * (ds / chunks), (ds /
chunks) * sizeof(ft), cudaMemcpyDeviceToHost, streams[i %
num_streams]);
}
cudaDeviceSynchronize();
cudaCheckErrors("streams execution error");

et = dtime_usec(et);

for (int i = 0; i < ds; i++) {

if (h_y[i] != h_y1[i]) {
std::cout << "mismatch at " << i << " was: " << h_y[i] << " should be: "
<< h_y1[i] << std::endl;
return -1;
}
}

std::cout << "streams elapsed time: " << et/(float)USECPSEC <<

std::endl;
#endif

return 0;
}
#include <math.h>
#include <iostream>
#include <time.h>
#include <sys/time.h>
#include <stdio.h>

// modifiable
typedef float ft;
const int chunks = 64;
const size_t ds = 1024*1024*chunks;
const int count = 22;
const int num_streams = 8;

device double gpdf(double val, double sigma) {

return exp(-0.5 * val * val) / (sigma * sqrt_2PI);
}

// error check macro

// host-based timing
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start) {

timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

cudaStream_t streams[num_streams];
for (int i = 0; i < num_streams; i++) {
cudaStreamCreate(&streams[i]);
}
cudaCheckErrors("stream creation error");
gaussian_pdf<<<(ds + 255) / 256, 256>>>(d_x, d_y, 0.0, 1.0, ds); //
warm-up

for (size_t i = 0; i < ds; i++) {

h_x[i] = rand() / (ft)RAND_MAX;
}
cudaDeviceSynchronize();

unsigned long long et1 = dtime_usec(0);

cudaMemcpy(d_x, h_x, ds * sizeof(ft), cudaMemcpyHostToDevice);

gaussian_pdf<<<(ds + 255) / 256, 256>>>(d_x, d_y, 0.0, 1.0, ds);
cudaMemcpy(h_y1, d_y, ds * sizeof(ft), cudaMemcpyDeviceToHost);
cudaCheckErrors("non-streams execution error");

et1 = dtime_usec(et1);
std::cout << "non-stream elapsed time: " << et1/(float)USECPSEC <<
std::endl;

#ifdef USE_STREAMS
cudaMemset(d_y, 0, ds * sizeof(ft));

unsigned long long et = dtime_usec(0);

for (int i = 0; i < chunks; i++) { //depth-first launch

for (int i = 0; i < ds; i++) {

if (h_y[i] != h_y1[i]) {
std::cout << "mismatch at " << i << " was: " << h_y[i] << " should be: "
<< h_y1[i] << std::endl;
return -1;
}
}

std::cout << "streams elapsed time: " << et/(float)USECPSEC <<

std::endl;
#endif

return 0;
}
This excercise, in 3 parts, is designed to walk you through a Nsight
Compute-driven analysis-driven optimization sequence. The overall
exercise is focused on optimizing square matrix transpose. This operation
can be simply described as:

Bij = Aji

for input matrix A, output matrix B, and indices i and j varying over the
square matrix side dimension. This algorithm involves no compute activity,
therefore it is a memory bound algorithm, and our final objective will be to
come as close as possible to the available memory bandwidth of the GPU
we are running on.

## 1. Naive Global-Memory Matrix Transpose

For your first task, change into the *task1* directory. There you should edit
the *task1.cu* file to complete the matrix transpose operation. Most of the
code is written for you, but replace the **FIXME** entries with the proper
code to complete the matrix transpose using global memory. The formula
given above should guide your efforts. Here are some hints:

- Each thread reads from (row, col) and writes to (col, row)
- Using indexing macro:

```cpp
#define INDX( row, col, ld ) ( ( (row) * (ld) ) + (col) )
ld = leading dimension (width)
```

If you need help, you can refer to the *task1_solution.cu* file. Then
compile and test your code:

```bash
module load cuda
./build_nvcc
```

To run your code, we will use an LSF command:

```bash
bsub -W 10 -nnodes 1 -P <allocation_ID> -Is jsrun -n1 -a1 -c1 -g1 ./task1
```

Alternatively, you may want to create an alias for your bsub command in
order to make subsequent runs easier:

```bash
alias lsfrun='bsub -W 10 -nnodes 1 -P <allocation_ID> -Is jsrun -n1 -a1 -c1
-g1'
lsfrun ./task1
```

To run your code at NERSC on Cori, we can use Slurm:

```bash
module load esslurm
srun -C gpu -N 1 -n 1 -t 10 -A m3502 --gres=gpu:1 -c 10 ./task1
```

Allocation `m3502` is a custom allocation set up on Cori for this training

```bash
salloc -C gpu -N 1 -t 60 -A m3502 --gres=gpu:1 -c 10
srun -n 1 ./task1
```

Note that you only need to `module load esslurm` once per login session;
this is what enables you to submit to the Cori GPU nodes.

You should get a PASS result indication, along with a measurement of

performance in terms of achieved bandwidth.

One you have a PASS result, begin the first round of analysis by running
the profiler:
```bash
module load nsight-compute
lsfrun nv-nsight-cu-cli --metrics
l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum,l1tex__t_requests_pipe
_lsu_mem_global_op_ld.sum,l1tex__average_t_sectors_per_request_pipe
_lsu_mem_global_op_ld.ratio,l1tex__t_sectors_pipe_lsu_mem_global_op_
st.sum,l1tex__t_requests_pipe_lsu_mem_global_op_st.sum,l1tex__averag
e_t_sectors_per_request_pipe_lsu_mem_global_op_st.ratio,smsp__sass_
average_data_bytes_per_sector_mem_global_op_ld.pct,smsp__sass_aver
age_data_bytes_per_sector_mem_global_op_st.pct ./task1
```

Here's a breakdown of the metrics we are requesting from the profiler:

- l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum: The number of

global load transactions
- *l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum*: The number of
global load requests
-
*l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_ld.ratio
*: The number of global load transactions per request
- *smsp__sass_average_data_bytes_per_sector_mem_global_op_ld.pct*:
The global load efficiency
- *l1tex__t_sectors_pipe_lsu_mem_global_op_st.sum*: The number of
global store transactions
- *l1tex__t_requests_pipe_lsu_mem_global_op_st.sum*: The number of
global store requests
-
*l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_st.ratio
*: The number of global store transactions per request
- *smsp__sass_average_data_bytes_per_sector_mem_global_op_st.pct*:
The global store efficiency
Using these metrics, we can easily observe various characteristics of our
kernel. Many of these metrics are self-explanatory, but it may not be
immediately obvious how global load and store *efficiency* is calculated.
We can also calculate our global load and store efficiences by dividing the
theoretical minimum number of transactions per request by the actual
number of transactions per request we calculated from the above metrics.

How do we know what the theoretical minimum number of transactions per

request actually is? A cache line is 128 bytes, and there are 32 threads in a
warp. If the 32 threads are accessing consecutive 4 byte words (i.e. single
precision floats), then there should be 4 transactions in that request (we are
just asking for four consecutive 32-byte sectors of DRAM). In our case, we
are using double precision floats, so the 32 threads would be accessing
consecutive 8 byte words (256 bytes total). Therefore, the theoretical
minimum number of transactions per request in our case would be 8 (eight
consecutive 32-byte sectors of DRAM).

Considering the output of the profiler, are the Global Load Efficiency and
Global Store Efficiency both at 100%? Why or why not? This may be a
good time to study the load and store indexing carefully, and review the
global coalescing rules learned in Homework 4.

## **2. Fix Global Memory Coalescing Issue via Shared-Memory Tiled

Transpose**

In task 1, we learned that the naive global memory transpose algorithm

suffers from the fact that either the load or store operation must be
un-coalesced, i.e. columnar memory access. To fix this, we must come up
with a procedure that will allow both coalesced loads and coalesced stores
to global memory. Therefore, we will move tiles from the input matrix into
shared memory, and then write that tile out to the output matrix, to allow a
transpose of the tile. This involves a read from global, write to shared,
followe by a read from shared, write to global.

During these two steps, we will need to:

- perform an "*in-tile*" transpose, i.e. either read row-wise and write
column-wise, or vice versa
- perform a "*tile-position*" transpose, meaning an input tile at tile indices
*i,j* must be stored to tile indices *j,i* in the output matrix.

Change from directory *task1* to *task2*. Edit the *task2.cu* file, wherever
the **FIXME** occurs, to achieve the above two operations. If you need
help, refer to the *task2_solution.cu* file. This is the hardest programming
assignment of the 3 tasks in this exercise.

As in task1, compile and run your code:

```bash
./build_nvcc
lsfrun ./task2
```

You should get a PASS output. Has the measured bandwidth improved?

Once again we will use the profiler to help explain our observations. We
have introduced shared memory operations into our algorithm, so we will
include shared memory measure metrics in our profiling:

```bash
lsfrun nv-nsight-cu-cli --metrics
l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum,l1tex__t_requests_pipe
_lsu_mem_global_op_ld.sum,l1tex__average_t_sectors_per_request_pipe
_lsu_mem_global_op_ld.ratio,l1tex__t_sectors_pipe_lsu_mem_global_op_
st.sum,l1tex__t_requests_pipe_lsu_mem_global_op_st.sum,l1tex__averag
e_t_sectors_per_request_pipe_lsu_mem_global_op_st.ratio,smsp__sass_
average_data_bytes_per_sector_mem_global_op_ld.pct,smsp__sass_aver
age_data_bytes_per_sector_mem_global_op_st.pct,l1tex__data_pipe_lsu_
wavefronts_mem_shared_op_ld.sum,l1tex__data_pipe_lsu_wavefronts_m
em_shared_op_st.sum,l1tex__data_bank_conflicts_pipe_lsu_mem_shared
_op_ld.sum,l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.sum
,smsp__sass_average_data_bytes_per_wavefront_mem_shared.pct
./task2
```

- *l1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld.sum*: The
number of shared load transactions
- *l1tex__data_pipe_lsu_wavefronts_mem_shared_op_st.sum*: The
number of shared store transactions
- *l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum*: The
number of shared load bank conflicts
- *l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.sum*: The
number of shared store bank conflicts
- *smsp__sass_average_data_bytes_per_wavefront_mem_shared.pct*:
Shared Memory efficiency

You should be able to confirm that the previous global load/global store
efficiency issues have been resolved, with proper coalescing. However
now we have a problem with shared memory: bank conflicts. Review
module 4 information on bank conflicts, for a basic definition of how these
arise during shared memory access.

## 3. Fixing shared memory bank conflicts

Our strategy to fix shared memory bank conflicts in this case is fairly
simple. We will leave the shared memory indexing unchanged from
exercise 2, but we will add a column to the shared memory definition in our
code. This will allow both row-wise and columnar access to shared memory
(needed for our in-tile transpose step) without bank conflicts.

Change to the task3 directory.

Modify the task3.cu code as needed. If you need help, refer to

*task3_solution.cu*.
Compile and run your code in a similar fashion to the previous 2 tasks.

You should get a passing result. Has the achieved bandwidth improved?

You can profile your code to confirm that we are now using shared memory
in an efficient fashion, for both loads and stores.

Finally, if you wish, compare the achieved bandwidth reported by your

code, to a proxy measurement of the peak achievable bandwidth, by
running the bandwidthTest CUDA sample code and using the
device-to-device memory number for comparison.

nvcc -O3 -o task1 task1.cu

nv-nsight-cu-cli --metrics
l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum,l1tex__t_requests_pipe
_lsu_mem_global_op_ld.sum,l1tex__t_sectors_pipe_lsu_mem_global_op_
st.sum,l1tex__t_requests_pipe_lsu_mem_global_op_st.sum ./task1
/*
* Copyright 2014 NVIDIA Corporation
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include <stdio.h>

#ifdef DEBUG
#define CUDA_CALL(F) if( (F) != cudaSuccess ) \
{printf("Error %s at %s:%d\n", cudaGetErrorString(cudaGetLastError()), \
__FILE__,__LINE__); exit(-1);}
#define CUDA_CHECK() if( (cudaPeekAtLastError()) != cudaSuccess ) \
{printf("Error %s at %s:%d\n", cudaGetErrorString(cudaGetLastError()), \
__FILE__,__LINE__-1); exit(-1);}
#else
#define CUDA_CALL(F) (F)
#define CUDA_CHECK()
#endif

/* definitions of threadblock size in X and Y directions */

#define THREADS_PER_BLOCK_X 32
#define THREADS_PER_BLOCK_Y 32

/* definition of matrix linear dimension */

#define SIZE 4096

/* macro to index a 1D memory array with 2D indices in column-major order

#define INDX( row, col, ld ) ( ( (col) * (ld) ) + (row) )

/* CUDA kernel for naive matrix transpose */

global void naive_cuda_transpose( const int m,

const double * const a,
double * const c )
{
const int myRow = FIXME
const int myCol = FIXME

if( myRow < m && myCol < m )

{
c[FIXME] = a[FIXME];
} /* end if */
return;

} /* end naive_cuda_transpose */

void host_transpose( const int m, const double * const a, double *c )

{

/*
* naive matrix transpose goes here.
*/
for( int j = 0; j < m; j++ )
{
for( int i = 0; i < m; i++ )
{
c[INDX(i,j,m)] = a[INDX(j,i,m)];
} /* end for i */
} /* end for j */

} /* end host_dgemm */

int main( int argc, char *argv[] )

{

int size = SIZE;

fprintf(stdout, "Matrix size is %d\n",size);

/* declaring pointers for array */

double h_a, h_c;

double *d_a, *d_c;

size_t numbytes = (size_t) size * (size_t) size * sizeof( double );

/* allocating host memory */

h_a = (double *) malloc( numbytes );

if( h_a == NULL )
{
fprintf(stderr,"Error in host malloc h_a\n");
return 911;
}

h_c = (double *) malloc( numbytes );

if( h_c == NULL )
{
fprintf(stderr,"Error in host malloc h_c\n");
return 911;
}

/* allocating device memory */

CUDA_CALL( cudaMalloc( (void**) &d_a, numbytes ) );

CUDA_CALL( cudaMalloc( (void**) &d_c, numbytes ) );

/* set result matrices to zero */

memset( h_c, 0, numbytes );

CUDA_CALL( cudaMemset( d_c, 0, numbytes ) );

fprintf( stdout, "Total memory required per matrix is %lf MB\n",

(double) numbytes / 1000000.0 );

/* initialize input matrix with random value */

for( int i = 0; i < size * size; i++ )

{
h_a[i] = double( rand() ) / ( double(RAND_MAX) + 1.0 );
}

/* copy input matrix from host to device */

CUDA_CALL( cudaMemcpy( d_a, h_a, numbytes,

cudaMemcpyHostToDevice ) );

/* create and start timer */

cudaEvent_t start, stop;

CUDA_CALL( cudaEventCreate( &start ) );
CUDA_CALL( cudaEventCreate( &stop ) );
CUDA_CALL( cudaEventRecord( start, 0 ) );

/* call naive cpu transpose function */

host_transpose( size, h_a, h_c );

/* stop CPU timer */

CUDA_CALL( cudaEventRecord( stop, 0 ) );

CUDA_CALL( cudaEventSynchronize( stop ) );
float elapsedTime;
CUDA_CALL( cudaEventElapsedTime( &elapsedTime, start, stop ) );

/* print CPU timing information */

fprintf(stdout, "Total time CPU is %f sec\n", elapsedTime / 1000.0f );

fprintf(stdout, "Performance is %f GB/s\n",
8.0 * 2.0 * (double) size * (double) size /
( (double) elapsedTime / 1000.0 ) * 1.e-9 );

/* setup threadblock size and grid sizes */

dim3 threads( THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y,

1 );
dim3 blocks( FIXME, FIXME, 1 );

/* start timers */
CUDA_CALL( cudaEventRecord( start, 0 ) );

/* call naive GPU transpose kernel */

naive_cuda_transpose<<< blocks, threads >>>( size, d_a, d_c );

CUDA_CHECK()
CUDA_CALL( cudaDeviceSynchronize() );
/* stop the timers */

CUDA_CALL( cudaEventRecord( stop, 0 ) );

CUDA_CALL( cudaEventSynchronize( stop ) );
CUDA_CALL( cudaEventElapsedTime( &elapsedTime, start, stop ) );

/* print GPU timing information */

fprintf(stdout, "Total time GPU is %f sec\n", elapsedTime / 1000.0f );

fprintf(stdout, "Performance is %f GB/s\n",
8.0 * 2.0 * (double) size * (double) size /
( (double) elapsedTime / 1000.0 ) * 1.e-9 );

/* copy data from device to host */

CUDA_CALL( cudaMemset( d_a, 0, numbytes ) );

CUDA_CALL( cudaMemcpy( h_a, d_c, numbytes,
cudaMemcpyDeviceToHost ) );

/* compare GPU to CPU for correctness */

for( int j = 0; j < size; j++ )

{
for( int i = 0; i < size; i++ )
{
if( h_c[INDX(i,j,size)] != h_a[INDX(i,j,size)] )
{
printf("Error in element %d,%d\n", i,j );
printf("Host %f, device %f\n",h_c[INDX(i,j,size)],
h_a[INDX(i,j,size)]);
printf("FAIL\n");
goto end;
} /* end fi */
} /* end for i */
} /* end for j */
/* free the memory */
printf("PASS\n");

end:
free( h_a );
free( h_c );
CUDA_CALL( cudaFree( d_a ) );
CUDA_CALL( cudaFree( d_c ) );
CUDA_CALL( cudaDeviceReset() );

return 0;
} /* end main */

/*
* Copyright 2014 NVIDIA Corporation
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include <stdio.h>

/* definitions of threadblock size in X and Y directions */

#define THREADS_PER_BLOCK_X 32
#define THREADS_PER_BLOCK_Y 32

/* definition of matrix linear dimension */

#define SIZE 4096

/* macro to index a 1D memory array with 2D indices in column-major order

#define INDX( row, col, ld ) ( ( (col) * (ld) ) + (row) )

/* CUDA kernel for naive matrix transpose */

global void naive_cuda_transpose( const int m,

const double * const a,
double * const c )
{
const int myRow = blockDim.x * blockIdx.x + threadIdx.x;
const int myCol = blockDim.y * blockIdx.y + threadIdx.y;

if( myRow < m && myCol < m )

{
c[INDX( myRow, myCol, m )] = a[INDX( myCol, myRow, m )];
} /* end if */
return;

} /* end naive_cuda_transpose */

void host_transpose( const int m, const double * const a, double *c )

{

/*
* naive matrix transpose goes here.
*/

for( int j = 0; j < m; j++ )

{
for( int i = 0; i < m; i++ )
{
c[INDX(i,j,m)] = a[INDX(j,i,m)];
} /* end for i */
} /* end for j */

} /* end host_dgemm */

int main( int argc, char *argv[] )

{

int size = SIZE;

fprintf(stdout, "Matrix size is %d\n",size);

/* declaring pointers for array */

double h_a, h_c;

double *d_a, *d_c;
size_t numbytes = (size_t) size * (size_t) size * sizeof( double );

/* allocating host memory */

h_a = (double *) malloc( numbytes );

if( h_a == NULL )
{
fprintf(stderr,"Error in host malloc h_a\n");
return 911;
}

h_c = (double *) malloc( numbytes );

if( h_c == NULL )
{
fprintf(stderr,"Error in host malloc h_c\n");
return 911;
}

/* allocating device memory */

CUDA_CALL( cudaMalloc( (void**) &d_a, numbytes ) );

CUDA_CALL( cudaMalloc( (void**) &d_c, numbytes ) );

/* set result matrices to zero */

memset( h_c, 0, numbytes );

CUDA_CALL( cudaMemset( d_c, 0, numbytes ) );

fprintf( stdout, "Total memory required per matrix is %lf MB\n",

(double) numbytes / 1000000.0 );

/* initialize input matrix with random value */

for( int i = 0; i < size * size; i++ )

{
h_a[i] = double( rand() ) / ( double(RAND_MAX) + 1.0 );
}

/* copy input matrix from host to device */

CUDA_CALL( cudaMemcpy( d_a, h_a, numbytes,

cudaMemcpyHostToDevice ) );

/* create and start timer */

cudaEvent_t start, stop;

CUDA_CALL( cudaEventCreate( &start ) );
CUDA_CALL( cudaEventCreate( &stop ) );
CUDA_CALL( cudaEventRecord( start, 0 ) );

/* call naive cpu transpose function */

host_transpose( size, h_a, h_c );

/* stop CPU timer */

CUDA_CALL( cudaEventRecord( stop, 0 ) );

CUDA_CALL( cudaEventSynchronize( stop ) );
float elapsedTime;
CUDA_CALL( cudaEventElapsedTime( &elapsedTime, start, stop ) );

/* print CPU timing information */

fprintf(stdout, "Total time CPU is %f sec\n", elapsedTime / 1000.0f );

fprintf(stdout, "Performance is %f GB/s\n",
8.0 * 2.0 * (double) size * (double) size /
( (double) elapsedTime / 1000.0 ) * 1.e-9 );

/* setup threadblock size and grid sizes */

dim3 threads( THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y,
1 );
dim3 blocks( ( size / THREADS_PER_BLOCK_X ) + 1,
( size / THREADS_PER_BLOCK_Y ) + 1, 1 );

/* start timers */
CUDA_CALL( cudaEventRecord( start, 0 ) );

/* call naive GPU transpose kernel */

naive_cuda_transpose<<< blocks, threads >>>( size, d_a, d_c );

CUDA_CHECK()
CUDA_CALL( cudaDeviceSynchronize() );

/* stop the timers */

CUDA_CALL( cudaEventRecord( stop, 0 ) );

CUDA_CALL( cudaEventSynchronize( stop ) );
CUDA_CALL( cudaEventElapsedTime( &elapsedTime, start, stop ) );

/* print GPU timing information */

fprintf(stdout, "Total time GPU is %f sec\n", elapsedTime / 1000.0f );

fprintf(stdout, "Performance is %f GB/s\n",
8.0 * 2.0 * (double) size * (double) size /
( (double) elapsedTime / 1000.0 ) * 1.e-9 );

/* copy data from device to host */

CUDA_CALL( cudaMemset( d_a, 0, numbytes ) );

CUDA_CALL( cudaMemcpy( h_a, d_c, numbytes,
cudaMemcpyDeviceToHost ) );

/* compare GPU to CPU for correctness */

for( int j = 0; j < size; j++ )
{
for( int i = 0; i < size; i++ )
{
if( h_c[INDX(i,j,size)] != h_a[INDX(i,j,size)] )
{
printf("Error in element %d,%d\n", i,j );
printf("Host %f, device %f\n",h_c[INDX(i,j,size)],
h_a[INDX(i,j,size)]);
printf("FAIL\n");
goto end;
} /* end fi */
} /* end for i */
} /* end for j */

/* free the memory */

printf("PASS\n");

end:
free( h_a );
free( h_c );
CUDA_CALL( cudaFree( d_a ) );
CUDA_CALL( cudaFree( d_c ) );
CUDA_CALL( cudaDeviceReset() );

return 0;
} /* end main */

nvcc -O3 -o task2 task2.cu

nv-nsight-cu-cli --metrics
l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum,l1tex__t_requests_pipe
_lsu_mem_global_op_ld.sum,l1tex__t_sectors_pipe_lsu_mem_global_op_
st.sum,l1tex__t_requests_pipe_lsu_mem_global_op_st.sum,l1tex__data_pi
pe_lsu_wavefronts_mem_shared_op_ld.sum,l1tex__data_pipe_lsu_wavefr
onts_mem_shared_op_st.sum,l1tex__data_bank_conflicts_pipe_lsu_mem_
shared_op_ld.sum,l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op
_st.sum,smsp__cycles_active.avg.pct_of_peak_sustained_elapsed ./task2

#include <stdio.h>
#include <math.h>

/* definitions of threadblock size in X and Y directions */

#define THREADS_PER_BLOCK_X 32
#define THREADS_PER_BLOCK_Y 32

/* definition of matrix linear dimension */

#define SIZE 4096

/* macro to index a 1D memory array with 2D indices in column-major order

#define INDX( row, col, ld ) ( ( (col) * (ld) ) + (row) )

/* CUDA kernel for shared memory matrix transpose */

global void smem_cuda_transpose( const int m,

double const * const a,
double * const c )
{

/* declare a shared memory array */

shared double smemArray[FIXME][FIXME];

/* determine my row and column indices for the error checking code */

const int myRow = blockDim.x * blockIdx.x + threadIdx.x;

const int myCol = blockDim.y * blockIdx.y + threadIdx.y;

/* determine my row tile and column tile index */

const int tileX = FIXME
const int tileY = FIXME

if( myRow < m && myCol < m )

{
/* read to the shared mem array */
/* HINT: threadIdx.x should appear somewhere in the first argument to */
/* your INDX calculation for both a[] and c[]. This will ensure proper */
/* coalescing. */

smemArray[FIXME][FIXME] =
a[FIXME];
} /* end if */

/* synchronize */
__syncthreads();

if( myRow < m && myCol < m )
{
/* write the result */
c[FIXME] =
smemArray[FIXME][FIXME];
} /* end if */
return;

} /* end smem_cuda_transpose */

void host_transpose( const int m, double const * const a, double * const c )

{

/*
* naive matrix transpose goes here.
*/

for( int j = 0; j < m; j++ )

{
for( int i = 0; i < m; i++ )
{
c[INDX(i,j,m)] = a[INDX(j,i,m)];
} /* end for i */
} /* end for j */

} /* end host_dgemm */

int main( int argc, char *argv[] )

{

int size = SIZE;

fprintf(stdout, "Matrix size is %d\n",size);

/* declaring pointers for array */

double h_a, h_c;

double *d_a, *d_c;

size_t numbytes = (size_t) size * (size_t) size * sizeof( double );

/* allocating host memory */

h_a = (double *) malloc( numbytes );

if( h_a == NULL )
{
fprintf(stderr,"Error in host malloc h_a\n");
return 911;
}

h_c = (double *) malloc( numbytes );

if( h_c == NULL )
{
fprintf(stderr,"Error in host malloc h_c\n");
return 911;
}

/* allocating device memory */

CUDA_CALL( cudaMalloc( (void**) &d_a, numbytes ) );

CUDA_CALL( cudaMalloc( (void**) &d_c, numbytes ) );

/* set result matrices to zero */

memset( h_c, 0, numbytes );

CUDA_CALL( cudaMemset( d_c, 0, numbytes ) );

fprintf( stdout, "Total memory required per matrix is %lf MB\n",

(double) numbytes / 1000000.0 );

/* initialize input matrix with random value */

for( int i = 0; i < size * size; i++ )

{
h_a[i] = double( rand() ) / ( double(RAND_MAX) + 1.0 );
} /* end for */

/* copy input matrix from host to device */

CUDA_CALL( cudaMemcpy( d_a, h_a, numbytes,

cudaMemcpyHostToDevice ) );

/* create and start timer */

cudaEvent_t start, stop;

CUDA_CALL( cudaEventCreate( &start ) );
CUDA_CALL( cudaEventCreate( &stop ) );
CUDA_CALL( cudaEventRecord( start, 0 ) );
/* call naive cpu transpose function */

host_transpose( size, h_a, h_c );

/* stop CPU timer */

CUDA_CALL( cudaEventRecord( stop, 0 ) );

CUDA_CALL( cudaEventSynchronize( stop ) );
float elapsedTime;
CUDA_CALL( cudaEventElapsedTime( &elapsedTime, start, stop ) );

/* print CPU timing information */

fprintf(stdout, "Total time CPU is %f sec\n", elapsedTime / 1000.0f );

fprintf(stdout, "Performance is %f GB/s\n",
8.0 * 2.0 * (double) size * (double) size /
( (double) elapsedTime / 1000.0 ) * 1.e-9 );

/* setup threadblock size and grid sizes */

dim3 threads( THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y,

1 );
dim3 blocks( ( size / THREADS_PER_BLOCK_X ) + 1,
( size / THREADS_PER_BLOCK_Y ) + 1, 1 );

/* start timers */
CUDA_CALL( cudaEventRecord( start, 0 ) );

/* call smem GPU transpose kernel */

smem_cuda_transpose<<< blocks, threads >>>( size, d_a, d_c );

CUDA_CHECK();
CUDA_CALL( cudaDeviceSynchronize() );
/* stop the timers */

CUDA_CALL( cudaEventRecord( stop, 0 ) );

CUDA_CALL( cudaEventSynchronize( stop ) );
CUDA_CALL( cudaEventElapsedTime( &elapsedTime, start, stop ) );

/* print GPU timing information */

fprintf(stdout, "Total time GPU is %f sec\n", elapsedTime / 1000.0f );

fprintf(stdout, "Performance is %f GB/s\n",
8.0 * 2.0 * (double) size * (double) size /
( (double) elapsedTime / 1000.0 ) * 1.e-9 );

/* copy data from device to host */

CUDA_CALL( cudaMemset( d_a, 0, numbytes ) );

CUDA_CALL( cudaMemcpy( h_a, d_c, numbytes,
cudaMemcpyDeviceToHost ) );

/* compare GPU to CPU for correctness */

for( int j = 0; j < size; j++ )

{
for( int i = 0; i < size; i++ )
{
if( h_c[INDX(i,j,size)] != h_a[INDX(i,j,size)] )
{
printf("Error in element %d,%d\n", i,j );
printf("Host %f, device %f\n",h_c[INDX(i,j,size)],
h_a[INDX(i,j,size)]);
printf("FAIL\n");
goto end;
}
} /* end for i */
} /* end for j */
/* free the memory */
printf("PASS\n");

end:
free( h_a );
free( h_c );
CUDA_CALL( cudaFree( d_a ) );
CUDA_CALL( cudaFree( d_c ) );

CUDA_CALL( cudaDeviceReset() );

return 0;
}

#include <stdio.h>
#include <math.h>
#ifdef DEBUG
#define CUDA_CALL(F) if( (F) != cudaSuccess ) \
{printf("Error %s at %s:%d\n", cudaGetErrorString(cudaGetLastError()), \
__FILE__,__LINE__); exit(-1);}
#define CUDA_CHECK() if( (cudaPeekAtLastError()) != cudaSuccess ) \
{printf("Error %s at %s:%d\n", cudaGetErrorString(cudaGetLastError()), \
__FILE__,__LINE__-1); exit(-1);}
#else
#define CUDA_CALL(F) (F)
#define CUDA_CHECK()
#endif

/* definitions of threadblock size in X and Y directions */

#define THREADS_PER_BLOCK_X 32
#define THREADS_PER_BLOCK_Y 32

/* definition of matrix linear dimension */

#define SIZE 4096

/* macro to index a 1D memory array with 2D indices in column-major order

#define INDX( row, col, ld ) ( ( (col) * (ld) ) + (row) )

/* CUDA kernel for shared memory matrix transpose */

global void smem_cuda_transpose( const int m,

double const * const a,
double * const c )
{

/* declare a shared memory array */
__shared__ double
smemArray[THREADS_PER_BLOCK_X][THREADS_PER_BLOCK_Y];

/* determine my row and column indices for the error checking code */

const int myRow = blockDim.x * blockIdx.x + threadIdx.x;

const int myCol = blockDim.y * blockIdx.y + threadIdx.y;

/* determine my row tile and column tile index */

const int tileX = blockDim.x * blockIdx.x;

const int tileY = blockDim.y * blockIdx.y;

if( myRow < m && myCol < m )

smemArray[threadIdx.x][threadIdx.y] =
a[INDX( tileX + threadIdx.x, tileY + threadIdx.y, m )];
} /* end if */

/* synchronize */
__syncthreads();

if( myRow < m && myCol < m )
{
/* write the result */
c[INDX( tileY + threadIdx.x, tileX + threadIdx.y, m )] =
smemArray[threadIdx.y][threadIdx.x];
} /* end if */
return;
} /* end smem_cuda_transpose */

void host_transpose( const int m, double const * const a, double * const c )

{

/*
* naive matrix transpose goes here.
*/

for( int j = 0; j < m; j++ )

{
for( int i = 0; i < m; i++ )
{
c[INDX(i,j,m)] = a[INDX(j,i,m)];
} /* end for i */
} /* end for j */

} /* end host_dgemm */

int main( int argc, char *argv[] )

{

int size = SIZE;

fprintf(stdout, "Matrix size is %d\n",size);

/* declaring pointers for array */

double h_a, h_c;

double *d_a, *d_c;

size_t numbytes = (size_t) size * (size_t) size * sizeof( double );

/* allocating host memory */

h_a = (double *) malloc( numbytes );
if( h_a == NULL )
{
fprintf(stderr,"Error in host malloc h_a\n");
return 911;
}

h_c = (double *) malloc( numbytes );

if( h_c == NULL )
{
fprintf(stderr,"Error in host malloc h_c\n");
return 911;
}

/* allocating device memory */

CUDA_CALL( cudaMalloc( (void**) &d_a, numbytes ) );

CUDA_CALL( cudaMalloc( (void**) &d_c, numbytes ) );

/* set result matrices to zero */

memset( h_c, 0, numbytes );

CUDA_CALL( cudaMemset( d_c, 0, numbytes ) );

fprintf( stdout, "Total memory required per matrix is %lf MB\n",

(double) numbytes / 1000000.0 );

/* initialize input matrix with random value */

for( int i = 0; i < size * size; i++ )

{
h_a[i] = double( rand() ) / ( double(RAND_MAX) + 1.0 );
} /* end for */
/* copy input matrix from host to device */

CUDA_CALL( cudaMemcpy( d_a, h_a, numbytes,

cudaMemcpyHostToDevice ) );

/* create and start timer */

cudaEvent_t start, stop;

CUDA_CALL( cudaEventCreate( &start ) );
CUDA_CALL( cudaEventCreate( &stop ) );
CUDA_CALL( cudaEventRecord( start, 0 ) );

/* call naive cpu transpose function */

host_transpose( size, h_a, h_c );

/* stop CPU timer */

CUDA_CALL( cudaEventRecord( stop, 0 ) );

CUDA_CALL( cudaEventSynchronize( stop ) );
float elapsedTime;
CUDA_CALL( cudaEventElapsedTime( &elapsedTime, start, stop ) );

/* print CPU timing information */

fprintf(stdout, "Total time CPU is %f sec\n", elapsedTime / 1000.0f );

fprintf(stdout, "Performance is %f GB/s\n",
8.0 * 2.0 * (double) size * (double) size /
( (double) elapsedTime / 1000.0 ) * 1.e-9 );

/* setup threadblock size and grid sizes */

dim3 threads( THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y,

1 );
dim3 blocks( ( size / THREADS_PER_BLOCK_X ) + 1,
( size / THREADS_PER_BLOCK_Y ) + 1, 1 );

/* start timers */
CUDA_CALL( cudaEventRecord( start, 0 ) );

/* call smem GPU transpose kernel */

smem_cuda_transpose<<< blocks, threads >>>( size, d_a, d_c );

CUDA_CHECK();
CUDA_CALL( cudaDeviceSynchronize() );

/* stop the timers */

CUDA_CALL( cudaEventRecord( stop, 0 ) );

CUDA_CALL( cudaEventSynchronize( stop ) );
CUDA_CALL( cudaEventElapsedTime( &elapsedTime, start, stop ) );

/* print GPU timing information */

fprintf(stdout, "Total time GPU is %f sec\n", elapsedTime / 1000.0f );

fprintf(stdout, "Performance is %f GB/s\n",
8.0 * 2.0 * (double) size * (double) size /
( (double) elapsedTime / 1000.0 ) * 1.e-9 );

/* copy data from device to host */

CUDA_CALL( cudaMemset( d_a, 0, numbytes ) );

CUDA_CALL( cudaMemcpy( h_a, d_c, numbytes,
cudaMemcpyDeviceToHost ) );

/* compare GPU to CPU for correctness */

for( int j = 0; j < size; j++ )

/* free the memory */

printf("PASS\n");

end:
free( h_a );
free( h_c );
CUDA_CALL( cudaFree( d_a ) );
CUDA_CALL( cudaFree( d_c ) );

CUDA_CALL( cudaDeviceReset() );

return 0;
}

nvcc -O3 -o task3 task3.cu

nv-nsight-cu-cli --metrics
l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum,l1tex__t_requests_pipe
_lsu_mem_global_op_ld.sum,l1tex__t_sectors_pipe_lsu_mem_global_op_
st.sum,l1tex__t_requests_pipe_lsu_mem_global_op_st.sum,l1tex__data_pi
pe_lsu_wavefronts_mem_shared_op_ld.sum,l1tex__data_pipe_lsu_wavefr
onts_mem_shared_op_st.sum,l1tex__data_bank_conflicts_pipe_lsu_mem_
shared_op_ld.sum,l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op
_st.sum,smsp__cycles_active.avg.pct_of_peak_sustained_elapsed,l1tex__
t_bytes_pipe_lsu_mem_global_op_st.sum.per_second,l1tex__t_bytes_pipe
_lsu_mem_global_op_ld.sum.per_second ./task3

#include <stdio.h>
#include <math.h>

/* definitions of threadblock size in X and Y directions */

#define THREADS_PER_BLOCK_X 32
#define THREADS_PER_BLOCK_Y 32

/* definition of matrix linear dimension */

#define SIZE 4096

/* macro to index a 1D memory array with 2D indices in column-major order

#define INDX( row, col, ld ) ( ( (col) * (ld) ) + (row) )

/* CUDA kernel for shared memory matrix transpose */

global void smem_cuda_transpose( const int m,

double const * const a,
double * const c )
{

/* declare a shared memory array */

__shared__ double
smemArray[THREADS_PER_BLOCK_X][THREADS_PER_BLOCK_Y];

/* determine my row and column indices for the error checking code */

const int myRow = blockDim.x * blockIdx.x + threadIdx.x;

const int myCol = blockDim.y * blockIdx.y + threadIdx.y;

/* determine my row tile and column tile index */

const int tileX = blockDim.x * blockIdx.x;
const int tileY = blockDim.y * blockIdx.y;

if( myRow < m && myCol < m )

smemArray[threadIdx.x][threadIdx.y] =
a[INDX( tileX + threadIdx.x, tileY + threadIdx.y, m )];
} /* end if */

} /* end smem_cuda_transpose */

void host_transpose( const int m, double const * const a, double * const c )

{

/*
* naive matrix transpose goes here.
*/
for( int j = 0; j < m; j++ )
{
for( int i = 0; i < m; i++ )
{
c[INDX(i,j,m)] = a[INDX(j,i,m)];
} /* end for i */
} /* end for j */

} /* end host_dgemm */

int main( int argc, char *argv[] )

{

int size = SIZE;

fprintf(stdout, "Matrix size is %d\n",size);

/* declaring pointers for array */

double h_a, h_c;

double *d_a, *d_c;

size_t numbytes = (size_t) size * (size_t) size * sizeof( double );

/* allocating host memory */

h_a = (double *) malloc( numbytes );

if( h_a == NULL )
{
fprintf(stderr,"Error in host malloc h_a\n");
return 911;
}

h_c = (double *) malloc( numbytes );

if( h_c == NULL )
{
fprintf(stderr,"Error in host malloc h_c\n");
return 911;
}

/* allocating device memory */

CUDA_CALL( cudaMalloc( (void**) &d_a, numbytes ) );

CUDA_CALL( cudaMalloc( (void**) &d_c, numbytes ) );

/* set result matrices to zero */

memset( h_c, 0, numbytes );

CUDA_CALL( cudaMemset( d_c, 0, numbytes ) );

fprintf( stdout, "Total memory required per matrix is %lf MB\n",

(double) numbytes / 1000000.0 );

/* initialize input matrix with random value */

for( int i = 0; i < size * size; i++ )

{
h_a[i] = double( rand() ) / ( double(RAND_MAX) + 1.0 );
} /* end for */

/* copy input matrix from host to device */

CUDA_CALL( cudaMemcpy( d_a, h_a, numbytes,

cudaMemcpyHostToDevice ) );

/* create and start timer */

cudaEvent_t start, stop;

CUDA_CALL( cudaEventCreate( &start ) );
CUDA_CALL( cudaEventCreate( &stop ) );
CUDA_CALL( cudaEventRecord( start, 0 ) );

/* call naive cpu transpose function */

host_transpose( size, h_a, h_c );

/* stop CPU timer */

CUDA_CALL( cudaEventRecord( stop, 0 ) );

CUDA_CALL( cudaEventSynchronize( stop ) );
float elapsedTime;
CUDA_CALL( cudaEventElapsedTime( &elapsedTime, start, stop ) );

/* print CPU timing information */

fprintf(stdout, "Total time CPU is %f sec\n", elapsedTime / 1000.0f );

fprintf(stdout, "Performance is %f GB/s\n",
8.0 * 2.0 * (double) size * (double) size /
( (double) elapsedTime / 1000.0 ) * 1.e-9 );

/* setup threadblock size and grid sizes */

dim3 threads( THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y,

1 );
dim3 blocks( ( size / THREADS_PER_BLOCK_X ) + 1,
( size / THREADS_PER_BLOCK_Y ) + 1, 1 );

/* start timers */
CUDA_CALL( cudaEventRecord( start, 0 ) );

/* call smem GPU transpose kernel */

smem_cuda_transpose<<< blocks, threads >>>( size, d_a, d_c );

CUDA_CHECK();
CUDA_CALL( cudaDeviceSynchronize() );
/* stop the timers */

CUDA_CALL( cudaEventRecord( stop, 0 ) );

CUDA_CALL( cudaEventSynchronize( stop ) );
CUDA_CALL( cudaEventElapsedTime( &elapsedTime, start, stop ) );

/* print GPU timing information */

fprintf(stdout, "Total time GPU is %f sec\n", elapsedTime / 1000.0f );

fprintf(stdout, "Performance is %f GB/s\n",
8.0 * 2.0 * (double) size * (double) size /
( (double) elapsedTime / 1000.0 ) * 1.e-9 );

/* copy data from device to host */

CUDA_CALL( cudaMemset( d_a, 0, numbytes ) );

CUDA_CALL( cudaMemcpy( h_a, d_c, numbytes,
cudaMemcpyDeviceToHost ) );

/* compare GPU to CPU for correctness */

for( int j = 0; j < size; j++ )

/* free the memory */

printf("PASS\n");

end:
free( h_a );
free( h_c );
CUDA_CALL( cudaFree( d_a ) );
CUDA_CALL( cudaFree( d_c ) );

CUDA_CALL( cudaDeviceReset() );

return 0;
}

/* definitions of threadblock size in X and Y directions */

#define THREADS_PER_BLOCK_X 32
#define THREADS_PER_BLOCK_Y 32

/* definition of matrix linear dimension */

#define SIZE 4096

/* macro to index a 1D memory array with 2D indices in column-major order

#define INDX( row, col, ld ) ( ( (col) * (ld) ) + (row) )

/* CUDA kernel for shared memory matrix transpose */

global void smem_cuda_transpose( const int m,

double const * const a,
double * const c )
{

/* declare a shared memory array */
__shared__ double
smemArray[THREADS_PER_BLOCK_X][THREADS_PER_BLOCK_Y+1];

/* determine my row and column indices for the error checking code */

const int myRow = blockDim.x * blockIdx.x + threadIdx.x;

const int myCol = blockDim.y * blockIdx.y + threadIdx.y;

/* determine my row tile and column tile index */

const int tileX = blockDim.x * blockIdx.x;

const int tileY = blockDim.y * blockIdx.y;

if( myRow < m && myCol < m )

smemArray[threadIdx.x][threadIdx.y] =
a[INDX( tileX + threadIdx.x, tileY + threadIdx.y, m )];
} /* end if */

void host_transpose( const int m, double const * const a, double * const c )

{

/*
* naive matrix transpose goes here.
*/

for( int j = 0; j < m; j++ )

{
for( int i = 0; i < m; i++ )
{
c[INDX(i,j,m)] = a[INDX(j,i,m)];
} /* end for i */
} /* end for j */

} /* end host_dgemm */

int main( int argc, char *argv[] )

{

int size = SIZE;

fprintf(stdout, "Matrix size is %d\n",size);

/* declaring pointers for array */

double h_a, h_c;

double *d_a, *d_c;

size_t numbytes = (size_t) size * (size_t) size * sizeof( double );

/* allocating host memory */

h_a = (double *) malloc( numbytes );
if( h_a == NULL )
{
fprintf(stderr,"Error in host malloc h_a\n");
return 911;
}

h_c = (double *) malloc( numbytes );

if( h_c == NULL )
{
fprintf(stderr,"Error in host malloc h_c\n");
return 911;
}

/* allocating device memory */

CUDA_CALL( cudaMalloc( (void**) &d_a, numbytes ) );

CUDA_CALL( cudaMalloc( (void**) &d_c, numbytes ) );

/* set result matrices to zero */

memset( h_c, 0, numbytes );

CUDA_CALL( cudaMemset( d_c, 0, numbytes ) );

fprintf( stdout, "Total memory required per matrix is %lf MB\n",

(double) numbytes / 1000000.0 );

/* initialize input matrix with random value */

for( int i = 0; i < size * size; i++ )

{
h_a[i] = double( rand() ) / ( double(RAND_MAX) + 1.0 );
} /* end for */
/* copy input matrix from host to device */

CUDA_CALL( cudaMemcpy( d_a, h_a, numbytes,

cudaMemcpyHostToDevice ) );

/* create and start timer */

cudaEvent_t start, stop;

CUDA_CALL( cudaEventCreate( &start ) );
CUDA_CALL( cudaEventCreate( &stop ) );
CUDA_CALL( cudaEventRecord( start, 0 ) );

/* call naive cpu transpose function */

host_transpose( size, h_a, h_c );

/* stop CPU timer */

CUDA_CALL( cudaEventRecord( stop, 0 ) );

CUDA_CALL( cudaEventSynchronize( stop ) );
float elapsedTime;
CUDA_CALL( cudaEventElapsedTime( &elapsedTime, start, stop ) );

/* print CPU timing information */

fprintf(stdout, "Total time CPU is %f sec\n", elapsedTime / 1000.0f );

fprintf(stdout, "Performance is %f GB/s\n",
8.0 * 2.0 * (double) size * (double) size /
( (double) elapsedTime / 1000.0 ) * 1.e-9 );

/* setup threadblock size and grid sizes */

dim3 threads( THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y,

1 );
dim3 blocks( ( size / THREADS_PER_BLOCK_X ) + 1,
( size / THREADS_PER_BLOCK_Y ) + 1, 1 );

/* start timers */
CUDA_CALL( cudaEventRecord( start, 0 ) );

/* call smem GPU transpose kernel */

smem_cuda_transpose<<< blocks, threads >>>( size, d_a, d_c );

CUDA_CHECK();
CUDA_CALL( cudaDeviceSynchronize() );

/* stop the timers */

CUDA_CALL( cudaEventRecord( stop, 0 ) );

CUDA_CALL( cudaEventSynchronize( stop ) );
CUDA_CALL( cudaEventElapsedTime( &elapsedTime, start, stop ) );

/* print GPU timing information */

fprintf(stdout, "Total time GPU is %f sec\n", elapsedTime / 1000.0f );

fprintf(stdout, "Performance is %f GB/s\n",
8.0 * 2.0 * (double) size * (double) size /
( (double) elapsedTime / 1000.0 ) * 1.e-9 );

/* copy data from device to host */

CUDA_CALL( cudaMemset( d_a, 0, numbytes ) );

CUDA_CALL( cudaMemcpy( h_a, d_c, numbytes,
cudaMemcpyDeviceToHost ) );

/* compare GPU to CPU for correctness */

for( int j = 0; j < size; j++ )

/* free the memory */

printf("PASS\n");

end:
free( h_a );
free( h_c );
CUDA_CALL( cudaFree( d_a ) );
CUDA_CALL( cudaFree( d_c ) );

CUDA_CALL( cudaDeviceReset() );

return 0;
}

# 1. Exploring Threadblock-Level Groups

## 1a. Creating Groups

First, you should take the *task1.cu* code, and complete the sections
indicated by **FIXME** to provide a proper thread-block group, and assign
that group to the group being used for printout purposes. You should only
need to modify the 2 lines containing **FIXME** for this first step.
You can compile your code as follows:

```bash
module load cuda
nvcc -arch=sm_70 -o task1 task1.cu -std=c++11
```

The module load command selects a CUDA compiler for your use. The
module load command only needs to be done once per session/login.
*nvcc* is the CUDA compiler invocation command. The syntax is generally
similar to gcc/g++. Note that because we're using C++11 (which is required
for cooperative groups) we need a sufficiently modern compiler (gcc >= 5
should be sufficient). If you're on Summit, make sure to do `module load
gcc` because the system default gcc is not recent enough.

To run your code, we will use an LSF command:

```bash
bsub -W 10 -nnodes 1 -P <allocation_ID> -Is jsrun -n1 -a1 -c1 -g1 ./task1
```

Alternatively, you may want to create an alias for your bsub command in
order to make subsequent runs easier:

```bash
alias lsfrun='bsub -W 10 -nnodes 1 -P <allocation_ID> -Is jsrun -n1 -a1 -c1
-g1'
lsfrun ./task1
```

To run your code at NERSC on Cori, we can use Slurm:

```bash
module load esslurm
srun -C gpu -N 1 -n 1 -t 10 -A m3502 --gres=gpu:1 -c 10 ./task1
```

Allocation `m3502` is a custom allocation set up on Cori for this training

```bash
salloc -C gpu -N 1 -t 60 -A m3502 --gres=gpu:1 -c 10
srun -n 1 ./task1
```

Note that you only need to `module load esslurm` once per login session;
this is what enables you to submit to the Cori GPU nodes.

Correct output should look like this:

```bash
group partial sum: 256
```

If you need help, refer to the *task1_solution1.cu* file. (which contains the
solution for tasks 1a, 1b, and 1c)

## 1b. Partitioning Groups

Next uncomment the next line that starts with the auto keyword, and
complete that line to use the previously created thread block group and
subdivide it into a set of 32-thread partitions, using the dynamic (runtime)
partitioning method.
Compile and run the code as above. correct output should look like:

```bash
group partial sum: 32
group partial sum: 32
group partial sum: 32
group partial sum: 32
group partial sum: 32
group partial sum: 32
group partial sum: 32
group partial sum: 32
```

## 1c. Third Group Creation/Decomposition

Now perform the 3rd group creation/decomposition.

Compile and run the code as above. Correct output should look like:

```bash
group partial sum: 16
group partial sum: 16
group partial sum: 16
group partial sum: 16
group partial sum: 16
group partial sum: 16
group partial sum: 16
group partial sum: 16
group partial sum: 16
group partial sum: 16
group partial sum: 16
group partial sum: 16
group partial sum: 16
group partial sum: 16
group partial sum: 16
group partial sum: 16
```

# 2. Exploring Grid-Wide Sync

One of the motivations suggested for a grid-wide sync is to combine

algorithm phases which need to be completed in sequence, and would
normally be realized with separate CUDA kernel calls. In this case, the
kernel launch boundary provides an implicit/effective grid-wide sync.
However cooperative groups provides the possibility of a grid wide sync
directly in kernel code, rather than at a kernel launch boundary.

One such algorithm would be stream compaction. Stream compaction is

used in many places, and fundamentally seeks to reduce the length of a
data stream using a particular removal heuristic or predicate test. For
example, if we had the following data stream:

```bash
343705080004
```

we could do stream compaction by removing the zeroes, ending up with:

```bash
3437584
```

Like many reduction type algorithms (the output here is potentially much
smaller than the input), we can easily imagine how to do this in a serial
fashion, but a fast parallel stream compaction requires some additional
thought. A common approach is to use a prefix sum. A prefix sum is a
data set, where each data item in the set represents the sum of the
previous input elements from the beginning of the input to that point. We
can use a prefix sum to help parallelize our stream compaction. We start
by creating an array of ones and zeroes, where there is a one
corresponding to the element we want to keep, and zero for the element we
want to discard:

```bash
3 4 3 7 0 5 0 8 0 0 0 4 (input data)
1 1 1 1 0 1 0 1 0 0 0 1 (filtering of input)
```

We then do an exclusive prefix sum on that filtered array (exclusive means

only the elements "to the left" are included in the sum. The element at that
position is excluded).

```bash
3 4 3 7 0 5 0 8 0 0 0 4 (input data)
1 1 1 1 0 1 0 1 0 0 0 1 (filtering of input)
0 1 2 3 4 4 5 5 6 6 6 6 (exclusive prefix sum of filtered data)
```

This prefix sum now contains the index into the output array that the input
position should be copied to. We only copy a position from input to output if
the corresponding filter element is not zero. This demonstrates how to use
a prefix sum to assist with a stream compaction, but doesn't identify how to
do the prefix sum in parallel, efficiently. A full treatment here is beyond the
scope of this document, but you can refer here for a good treatise:
https://people.eecs.berkeley.edu/~driscoll/cs267/papers/gpugems3_ch39.ht
ml Some key takeaways are that a prefix sum has a sweeping operation,
not unlike the sweeping operation that is successively performed in a
parallel reduction, but there are key differences. Two of these key
differences are that the sweep is from "left" to "right" in the prefix sum
whereas it is usually from right to left in a typical parallel reduction, and
also that the break points (i.e. the division of threads participating at each
sweep phase) is different.
When parallelizing a prefix sum, we often require multiple phases, for
example a thread-block level scan (prefix-sum) operation, followed by
another operation to "fix up" the threadblock level results based on the data
from other ("previous") thread blocks. These phases may require a
grid-wide sync, and typical scan from a library such as thrust will use
multiple kernel calls. Let's see if we can do it in a single kernel call. You
won't have to write any scan code, other than inserting appropriate
cooperative group sync points. We need sync points at the threadblock
leve (based on the threadblock level group created for you) and also at the
grid level.

Start with the task2.cu code, and perform 2 things:

- Modify the FIXME statements in the kernel to insert appropriate sync

operations as requested, based on the two group types created at the top
of the kernel. Only one grid-wide sync point is needed, the others are all
thread-block-level sync points.
- In the host code, modify the **FIXME** statements to do a proper
cooperative launch. The launch function is already provided, you just need
to fill in the remaining 4 arguments. Refer to the *task2_solution.cu* file for
help, or refer to the cuda runtime API documentation for the launch
function:
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EXECU
TION.html#group__CUDART__EXECUTION_1g504b94170f83285c71031b
e6d5d15f73

Once you have made the above modification, compile your code as follows:

```bash
nvcc -arch=sm_70 -o task2 task2.cu -rdc=true -std=c++11
```

and run it as follows:

```bash
lsfrun ./task2
```

Correct output should be simply:

```bash
number of SMs = 80
number of blocks per SM = 8
kernel time: 0.043872ms
thrust time: 0.083200ms
```

(The above is representative, if you run it on a GPU that is different than

Tesla V100, you may see different data, but should only see the above 4
lines)

The above informational data contains "occupancy" information produced

by the occupancy API. Note that we are able to put 8 of these 256 thread
threadblocks on a single SM, for the full maximum theoretical 2048 thread
complement per SM. This is 100% occupancy (the kernel is fairly simple
and low in resource utilization/requirements).

The code has silent validation built in, so no actual results are printed, other
than the above informational data. If you got a "*mismatch*" message,
something is wrong with your implementation.

Optional:

This task2 code compares the operation to an equivalent operation in

thrust. For this trivially small data set size, our monolithic kernel seems to
be faster than thrust. Run this small data set size using the nsight-compute
profiler to confirm for yourself that thrust is actually doing 2 kernel calls to
solve this problem:

```bash
module load nsight-compute
lsfrun nv-nsight-cu-cli ./task2
```

Now make the data set larger. A reasonable upper limit might be 32M
elements. Make sure to chose a number that is divisble by 256, the
threadblock size. For example, change:

```cpp
const int test_dsize = 256;
```

to something like:

```cpp
const int test_dsize = 1048576*16;
```

and recompile and rerun the code. Now which is faster, thrust or our naive
code?

Takeaway: don't write your own code if you can find a high-quality library
implementation. This is especially true for more complex algorithms like
sorting, prefix sums, and matrix multiply.

#include <cooperative_groups.h>
#include <stdio.h>
using namespace cooperative_groups;
const int nTPB = 256;
__device__ int reduce(thread_group g, int *x, int val) {
int lane = g.thread_rank();
for (int i = g.size()/2; i > 0; i /= 2) {
x[lane] = val; g.sync();
if (lane < i) val += x[lane + i]; g.sync();
}
if (g.thread_rank() == 0) printf("group partial sum: %d\n", val);
return val;
}

global void my_reduce_kernel(int *data){

shared int sdata[nTPB];

// task 1a: create a proper thread block group below
auto g1 = FIXME
size_t gindex = g1.group_index().x * nTPB + g1.thread_index().x;
// task 1b: uncomment and create a proper 32-thread tile below, using
group g1 created above
// auto g2 = FIXME
// task 1c: uncomment and create a proper 16-thread tile below, using
group g2 created above
// auto g3 = FIXME
// for each task, adjust the group to point to the last group created above
auto g = FIXME
// Make sure we send in the appropriate patch of shared memory
int sdata_offset = (g1.thread_index().x / g.size()) * g.size();
reduce(g, sdata + sdata_offset, data[gindex]);
}

int main(){

int *data;
cudaMallocManaged(&data, nTPB*sizeof(data[0]));
for (int i = 0; i < nTPB; i++) data[i] = 1;
my_reduce_kernel<<<1,nTPB>>>(data);
cudaError_t err = cudaDeviceSynchronize();
if (err != cudaSuccess) printf("cuda error: %s\n", cudaGetErrorString(err));
}

global void my_reduce_kernel(int *data){

shared int sdata[nTPB];

// task 1a: create a proper thread block group below
auto g1 = this_thread_block();
size_t gindex = g1.group_index().x * nTPB + g1.thread_index().x;
// task 1b: uncomment and create a proper 32-thread tile below, using
group g1 created above
auto g2 = tiled_partition(g1, 32);
// task 1c: uncomment and create a proper 16-thread tile below, using
group g2 created above
auto g3 = tiled_partition(g2, 16);
// for each task, adjust the group to point to the last group created above
auto g = g3;
// Make sure we send in the appropriate patch of shared memory
int sdata_offset = (g1.thread_index().x / g.size()) * g.size();
reduce(g, sdata + sdata_offset, data[gindex]);
}

int main(){

#include <stdio.h>
#include <stdlib.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/remove.h>
#include <cooperative_groups.h>

typedef int mytype;

const int test_dsize = 256;

const int nTPB = 256;

template <typename T>

__device__ unsigned predicate_test(T data, T testval){
if (data == testval) return 0;
return 1;
}

using namespace cooperative_groups;

// assume dsize is divisbile by nTPB

template <typename T>
__global__ void my_remove_if(const T * __restrict__ idata, const T
remove_val, T * __restrict__ odata, unsigned * __restrict__ idxs, const
unsigned dsize){

shared unsigned sidxs[nTPB];

auto g = this_thread_block();
auto gg = this_grid();
unsigned tidx = g.thread_rank();
unsigned gidx = tidx + nTPB*g.group_index().x;
unsigned gridSize = g.size()*gridDim.x;
// first use grid-stride loop to have each block do a prefix sum over data
set
for (unsigned i = gidx; i < dsize; i+=gridSize){
unsigned temp = predicate_test(idata[i], remove_val);
sidxs[tidx] = temp;
for (int j = 1; j < g.size(); j<<=1){
FIXME
if (j <= tidx){ temp += sidxs[tidx-j];}
FIXME
if (j <= tidx){ sidxs[tidx] = temp;}}
idxs[i] = temp;
FIXME}
// grid-wide barrier
FIXME
// then compute final index, and move input data to output location
unsigned stride = 0;
for (unsigned i = gidx; i < dsize; i+=gridSize){
T temp = idata[i];
if (predicate_test(temp, remove_val)){
unsigned my_idx = idxs[i];
for (unsigned j = 1; (j-1) < (g.group_index().x+(stride*gridDim.x)); j++)
my_idx += idxs[j*nTPB-1];
odata[my_idx-1] = temp;}
stride++;}
}

int main(){
// data setup
mytype *d_idata, *d_odata, *h_data;
unsigned *d_idxs;
size_t tsize = ((size_t)test_dsize)*sizeof(mytype);
h_data = (mytype *)malloc(tsize);
cudaMalloc(&d_idata, tsize);
cudaMalloc(&d_odata, tsize);
cudaMemset(d_odata, 0, tsize);
cudaMalloc(&d_idxs, test_dsize*sizeof(unsigned));
// check for support and device configuration
// and calculate maximum grid size
cudaDeviceProp prop;
cudaError_t err = cudaGetDeviceProperties(&prop, 0);
if (err != cudaSuccess) {printf("cuda error: %s\n",
cudaGetErrorString(err)); return 0;}
if (prop.cooperativeLaunch == 0) {printf("cooperative launch not
supported\n"); return 0;}
int numSM = prop.multiProcessorCount;
printf("number of SMs = %d\n", numSM);
int numBlkPerSM;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(&numBlkPerSM,
my_remove_if<mytype>, nTPB, 0);
printf("number of blocks per SM = %d\n", numBlkPerSM);
// test 1: no remove values
for (int i = 0; i < test_dsize; i++) h_data[i] = i;
cudaMemcpy(d_idata, h_data, tsize, cudaMemcpyHostToDevice);
cudaStream_t str;
cudaStreamCreate(&str);
mytype remove_val = -1;
unsigned ds = test_dsize;
void *args[] = {(void *)&d_idata, (void *)&remove_val, (void *)&d_odata,
(void *)&d_idxs, (void *)&ds};
dim3 grid(numBlkPerSM*numSM);
dim3 block(nTPB);
cudaLaunchCooperativeKernel((void *)my_remove_if<mytype>, FIXME);
err = cudaMemcpy(h_data, d_odata, tsize, cudaMemcpyDeviceToHost);
if (err != cudaSuccess) {printf("cuda error: %s\n",
cudaGetErrorString(err)); return 0;}
//validate
for (int i = 0; i < test_dsize; i++) if (h_data[i] != i){printf("mismatch 1 at %d,
was: %d, should be: %d\n", i, h_data[i], i); return 1;}
// test 2: with remove values
int val = 0;
for (int i = 0; i < test_dsize; i++){
if ((rand()/(float)RAND_MAX) > 0.5) h_data[i] = val++;
else h_data[i] = -1;}
thrust::device_vector<mytype> t_data(h_data, h_data+test_dsize);
cudaMemcpy(d_idata, h_data, tsize, cudaMemcpyHostToDevice);
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start);
cudaLaunchCooperativeKernel((void *)my_remove_if<mytype>, FIXME);
cudaEventRecord(stop);
float et;
cudaMemcpy(h_data, d_odata, tsize, cudaMemcpyDeviceToHost);
cudaEventElapsedTime(&et, start, stop);
//validate
for (int i = 0; i < val; i++) if (h_data[i] != i){printf("mismatch 2 at %d, was:
%d, should be: %d\n", i, h_data[i], i); return 1;}
printf("kernel time: %fms\n", et);
cudaEventRecord(start);
thrust::remove(t_data.begin(), t_data.end(), -1);
cudaEventRecord(stop);
thrust::host_vector<mytype> th_data = t_data;
// validate
for (int i = 0; i < val; i++) if (h_data[i] != th_data[i]){printf("mismatch 3 at
%d, was: %d, should be: %d\n", i, th_data[i], h_data[i]); return 1;}
cudaEventElapsedTime(&et, start, stop);
printf("thrust time: %fms\n", et);
return 0;
}

#include <stdio.h>
#include <stdlib.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/remove.h>
#include <cooperative_groups.h>

typedef int mytype;

const int test_dsize = 256;

const int nTPB = 256;

template <typename T>

__device__ unsigned predicate_test(T data, T testval){
if (data == testval) return 0;
return 1;
}

using namespace cooperative_groups;

// assume dsize is divisbile by nTPB

template <typename T>
__global__ void my_remove_if(const T * __restrict__ idata, const T
remove_val, T * __restrict__ odata, unsigned * __restrict__ idxs, const
unsigned dsize){

shared unsigned sidxs[nTPB];

auto g = this_thread_block();
auto gg = this_grid();
unsigned tidx = g.thread_rank();
unsigned gidx = tidx + nTPB*g.group_index().x;
unsigned gridSize = g.size()*gridDim.x;
// first use grid-stride loop to have each block do a prefix sum over data
set
for (unsigned i = gidx; i < dsize; i+=gridSize){
unsigned temp = predicate_test(idata[i], remove_val);
sidxs[tidx] = temp;
for (int j = 1; j < g.size(); j<<=1){
g.sync();
if (j <= tidx){ temp += sidxs[tidx-j];}
g.sync();
if (j <= tidx){ sidxs[tidx] = temp;}}
idxs[i] = temp;
g.sync();}
// grid-wide barrier
gg.sync();
// then compute final index, and move input data to output location
unsigned stride = 0;
for (unsigned i = gidx; i < dsize; i+=gridSize){
T temp = idata[i];
if (predicate_test(temp, remove_val)){
unsigned my_idx = idxs[i];
for (unsigned j = 1; (j-1) < (g.group_index().x+(stride*gridDim.x)); j++)
my_idx += idxs[j*nTPB-1];
odata[my_idx-1] = temp;}
stride++;}
}

int main(){
// data setup
mytype *d_idata, *d_odata, *h_data;
unsigned *d_idxs;
size_t tsize = ((size_t)test_dsize)*sizeof(mytype);
h_data = (mytype *)malloc(tsize);
cudaMalloc(&d_idata, tsize);
cudaMalloc(&d_odata, tsize);
cudaMemset(d_odata, 0, tsize);
cudaMalloc(&d_idxs, test_dsize*sizeof(unsigned));
// check for support and device configuration
// and calculate maximum grid size
cudaDeviceProp prop;
cudaError_t err = cudaGetDeviceProperties(&prop, 0);
if (err != cudaSuccess) {printf("cuda error: %s\n",
cudaGetErrorString(err)); return 0;}
if (prop.cooperativeLaunch == 0) {printf("cooperative launch not
supported\n"); return 0;}
int numSM = prop.multiProcessorCount;
printf("number of SMs = %d\n", numSM);
int numBlkPerSM;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(&numBlkPerSM,
my_remove_if<mytype>, nTPB, 0);
printf("number of blocks per SM = %d\n", numBlkPerSM);
// test 1: no remove values
for (int i = 0; i < test_dsize; i++) h_data[i] = i;
cudaMemcpy(d_idata, h_data, tsize, cudaMemcpyHostToDevice);
cudaStream_t str;
cudaStreamCreate(&str);
mytype remove_val = -1;
unsigned ds = test_dsize;
void *args[] = {(void *)&d_idata, (void *)&remove_val, (void *)&d_odata,
(void *)&d_idxs, (void *)&ds};
dim3 grid(numBlkPerSM*numSM);
dim3 block(nTPB);
cudaLaunchCooperativeKernel((void *)my_remove_if<mytype>, grid,
block, args, 0, str);
err = cudaMemcpy(h_data, d_odata, tsize, cudaMemcpyDeviceToHost);
if (err != cudaSuccess) {printf("cuda error: %s\n",
cudaGetErrorString(err)); return 0;}
//validate
for (int i = 0; i < test_dsize; i++) if (h_data[i] != i){printf("mismatch 1 at %d,
was: %d, should be: %d\n", i, h_data[i], i); return 1;}
// test 2: with remove values
int val = 0;
for (int i = 0; i < test_dsize; i++){
if ((rand()/(float)RAND_MAX) > 0.5) h_data[i] = val++;
else h_data[i] = -1;}
thrust::device_vector<mytype> t_data(h_data, h_data+test_dsize);
cudaMemcpy(d_idata, h_data, tsize, cudaMemcpyHostToDevice);
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start);
cudaLaunchCooperativeKernel((void *)my_remove_if<mytype>, grid,
block, args, 0, str);
cudaEventRecord(stop);
float et;
cudaMemcpy(h_data, d_odata, tsize, cudaMemcpyDeviceToHost);
cudaEventElapsedTime(&et, start, stop);
//validate
for (int i = 0; i < val; i++) if (h_data[i] != i){printf("mismatch 2 at %d, was:
%d, should be: %d\n", i, h_data[i], i); return 1;}
printf("kernel time: %fms\n", et);
cudaEventRecord(start);
thrust::remove(t_data.begin(), t_data.end(), -1);
cudaEventRecord(stop);
thrust::host_vector<mytype> th_data = t_data;
// validate
for (int i = 0; i < val; i++) if (h_data[i] != th_data[i]){printf("mismatch 3 at
%d, was: %d, should be: %d\n", i, th_data[i], h_data[i]); return 1;}
cudaEventElapsedTime(&et, start, stop);
printf("thrust time: %fms\n", et);
return 0;
}

## 1. Streams Review

For your first task, you are given a code that performs a silly computation
element-wise on a vector. We already implemented a chunked version of
this code using multiple CUDA streams in Homework 7. Let's start by
reviewing the performance impact that CUDA streams had on this code.
Compile it using the following:

```
module load cuda/11.4.0
nvcc -o streams streams.cu -DUSE_STREAMS
```

To run your code, we will use an LSF command:

```
bsub -W 10 -nnodes 1 -P <allocation_ID> -Is jsrun -n1 -a1 -c1 -g1 ./streams
```

Alternatively, you may want to create an alias for your bsub command in
order to make subsequent runs easier:

```
alias lsfrun='bsub -W 10 -nnodes 1 -P <allocation_ID> -Is jsrun -n1 -a1 -c1
-g1'
lsfrun ./streams
```

To build your code on NERSC's Cori-GPU

```
module load cgpu cuda/11.4.0
nvcc -o streams streams.cu -DUSE_STREAMS
```

To run during the node reservation (10:30-12:30 Pacific time on July 16):
```
module load cgpu cuda/11.4.0
srun -C gpu -N 1 -n 1 -t 10 -A ntrain --reservation=cuda_training -q shared
-G 1 -c 8 ./streams
```

or grab a GPU node first, then run interactively:

```
module load cgpu cuda
salloc -C gpu -N 1 -t 60 -A ntrain --reservation=cuda_training -q shared -G
1 -c 8
srun -n 1 ./streams
```

To run outside of the node reservation window:

Same steps as above, but do not include "--reservation=cuda_training -q
shared" in the srun or salloc commands.

In this case, the output will show the elapsed time of the non-overlapped
version of the code compared to the overlapped version of the code. The
non-overlapped version of the code copies the entire vector to the device,
then launches the processing kernel, then copies the entire vector back to
the host. In the overlapped version, the vector is broken up into chunks,
and then each chunk is copied and processed asynchronously on the GPU
using CUDA streams.

You can also run this code with Nsight Systems if you wish to observe the
overlapping behavior:

On Summit:
```
module load nsight-systems
lsfrun nsys profile -o <destination_dir>/streams.qdrep ./streams
```
On Cori:
```
module load nsight-systems
srun -n 1 nsys profile -o <destination_dir>/streams.qdrep ./streams
```

This visual output should show you the sequence of operations

(*cudaMemcpy* Host to Device, kernel call, and *cudaMemcpy* Device To
Host).
When you run the code, there will be a verification check performed, to
make sure you have processed the entire vector correctly, in chunks. If you
pass the verification test, the program will display the elapsed time of the
streamed version. The overlapped version of the code should be about 2X
faster (i.e. half the duration) of the non-streamed version. If you profiled the
code using Nsight Systems, you should be able to confirm that there is
indeed overlap of operations by zooming in on the portion of execution
related to kernel launches. You can see the non-overlapped version run,
followed by the overlapped version. Not only should the overlapped version
be faster, you should see an interleaving of computation and data transfer
operations.

## 2. OpenMP + CUDA Streams

For this particular application, launching kernels asynchronously from a

single CPU thread is sufficient. However, for legacy HPC applications that
use OpenMP for on-node shared memory processing, that may not be the
case. Many of these applications utilize MPI for distributing work across
nodes, and they use OpenMP for better on-node shared memory
processing. However, each OpenMP thread may still have quite a bit of
work that can benefit from GPU acceleration, albeit not enough work to
saturate the GPU on its own. In cases like this, we can combine OpenMP
threads with CUDA streams to make sure our GPU is fully utilized.

In order to simulate this behavior, your task is to distribute the processing of

this code's vector chunks across OpenMP threads. If done correctly, each
thread will submit work to the GPU asynchronously using the CUDA
streams decomposition that is already present in the code. Note that this
will have no performance impact on this particular sample code. The
objective is to show that we can combine CPU thread parallelism with
CUDA streams in order to achieve concurrent execution on one or more
GPUs.

Once you have inserted your OpenMP statement(s), compile and run using
the following instructions.

On Summit:
```
nvcc -Xcompiler -fopenmp -o streams streams.cu -DUSE_STREAMS
export OMP_NUM_THREADS=8
jsrun -n1 -a1 -c8 -bpacked:8 -g1 ./streams
```

On Cori:
```
nvcc -Xcompiler -fopenmp -o streams streams.cu -DUSE_STREAMS
export OMP_NUM_THREADS=8
srun -C gpu -N 1 -n 1 -t 10 -A ntrain --reservation=cuda_training -q shared
-G 1 -c 8 ./streams
```

What does the performance look like compared to exercise 1? It should

look pretty similar. How about when you profile the code? Unfortunately, the
profiler currently requires some serialization when profiling across CPU
threads, so you should actually see slower performance compared to the
non-overlapped version. This should be reflected in the resulting qdrep file.
Notice that we don't observe nearly as much concurrent execution on the
GPU. This is something we are working on, and future versions of the
profiler suffer from this limitation.

If you need help, refer to streams_solution.cu.

## 3. Bonus Task - Multi-GPU

Remember that a CUDA stream is tied to a particular GPU. How can we

combine CPU threading with more than a single GPU? If you're feeling
adventurous, try adapting this homework's code to submit work to 4 GPUs,
instead of just one. Note that this will require keeping track of which CUDA
stream was bound to which GPU when it was created. Feel free to increase
the problem size in order to ensure that there is enough work to observe a
performance impact. Compile and run your code using the following
instructions.

On Summit:
```
nvcc -Xcompiler -fopenmp -o streams streams.cu -DUSE_STREAMS
export OMP_NUM_THREADS=8
jsrun -n1 -a1 -c8 -bpacked:8 -g4 ./streams
```

On Cori:
```
nvcc -Xcompiler -fopenmp -o streams streams.cu -DUSE_STREAMS
export OMP_NUM_THREADS=8
srun -C gpu -N 1 -n 1 -t 10 -A ntrain --reservation=cuda_training -q shared
-G 4 -c 8 ./streams
```
#include <math.h>
#include <iostream>
#include <time.h>
#include <sys/time.h>
#include <stdio.h>

// modifiable
typedef float ft;
const int chunks = 64;
const size_t ds = 1024*1024*chunks;
const int count = 22;
const int num_streams = 8;

device double gpdf(double val, double sigma) {

return exp(-0.5 * val * val) / (sigma * sqrt_2PI);
}

// compute average gaussian pdf value over a window around each point
__global__ void gaussian_pdf(const ft * __restrict__ x, ft * __restrict__ y,
const ft mean, const ft sigma, const int n) {
int idx = threadIdx.x + blockDim.x * blockIdx.x;
if (idx < n) {
ft in = x[idx] - (count / 2) * 0.01f;
ft out = 0;
for (int i = 0; i < count; i++) {
ft temp = (in - mean) / sigma;
out += gpdf(temp, sigma);
in += 0.01f;
}
y[idx] = out / count;
}
}
// error check macro
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
}\
} while (0)

// host-based timing
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start) {

timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

cudaStream_t streams[num_streams];
for (int i = 0; i < num_streams; i++) {
cudaStreamCreate(&streams[i]);
}
cudaCheckErrors("stream creation error");

gaussian_pdf<<<(ds + 255) / 256, 256>>>(d_x, d_y, 0.0, 1.0, ds); //

warm-up

for (size_t i = 0; i < ds; i++) {

h_x[i] = rand() / (ft)RAND_MAX;
}
cudaDeviceSynchronize();

unsigned long long et1 = dtime_usec(0);

cudaMemcpy(d_x, h_x, ds * sizeof(ft), cudaMemcpyHostToDevice);

gaussian_pdf<<<(ds + 255) / 256, 256>>>(d_x, d_y, 0.0, 1.0, ds);
cudaMemcpy(h_y1, d_y, ds * sizeof(ft), cudaMemcpyDeviceToHost);
cudaCheckErrors("non-streams execution error");

et1 = dtime_usec(et1);
std::cout << "non-stream elapsed time: " << et1/(float)USECPSEC <<
std::endl;

#ifdef USE_STREAMS
cudaMemset(d_y, 0, ds * sizeof(ft));

unsigned long long et = dtime_usec(0);

for (int i = 0; i < chunks; i++) { //depth-first launch

cudaMemcpyAsync(d_x + i * (ds / chunks), h_x + i * (ds / chunks), (ds /
chunks) * sizeof(ft), cudaMemcpyHostToDevice, streams[i %
num_streams]);
gaussian_pdf<<<((ds / chunks) + 255) / 256, 256, 0, streams[i %
num_streams]>>>(d_x + i * (ds / chunks), d_y + i * (ds / chunks), 0.0, 1.0,
ds / chunks);
cudaMemcpyAsync(h_y + i * (ds / chunks), d_y + i * (ds / chunks), (ds /
chunks) * sizeof(ft), cudaMemcpyDeviceToHost, streams[i %
num_streams]);
}
cudaDeviceSynchronize();
cudaCheckErrors("streams execution error");

et = dtime_usec(et);

for (int i = 0; i < ds; i++) {

if (h_y[i] != h_y1[i]) {
std::cout << "mismatch at " << i << " was: " << h_y[i] << " should be: "
<< h_y1[i] << std::endl;
return -1;
}
}

std::cout << "streams elapsed time: " << et/(float)USECPSEC <<

std::endl;
#endif

return 0;
}
#include <math.h>
#include <iostream>
#include <time.h>
#include <sys/time.h>
#include <stdio.h>

// modifiable
typedef float ft;
const int chunks = 64;
const size_t ds = 1024*1024*chunks;
const int count = 22;
const int num_streams = 8;
// not modifiable
const float sqrt_2PIf = 2.5066282747946493232942230134974f;
const double sqrt_2PI = 2.5066282747946493232942230134974;
__device__ float gpdf(float val, float sigma) {
return expf(-0.5f * val * val) / (sigma * sqrt_2PIf);
}

device double gpdf(double val, double sigma) {

return exp(-0.5 * val * val) / (sigma * sqrt_2PI);
}

// error check macro

// host-based timing
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start) {

timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

cudaStream_t streams[num_streams];
for (int i = 0; i < num_streams; i++) {
cudaStreamCreate(&streams[i]);
}
cudaCheckErrors("stream creation error");

gaussian_pdf<<<(ds + 255) / 256, 256>>>(d_x, d_y, 0.0, 1.0, ds); //

warm-up

for (size_t i = 0; i < ds; i++) {

h_x[i] = rand() / (ft)RAND_MAX;
}
cudaDeviceSynchronize();

unsigned long long et1 = dtime_usec(0);

cudaMemcpy(d_x, h_x, ds * sizeof(ft), cudaMemcpyHostToDevice);

gaussian_pdf<<<(ds + 255) / 256, 256>>>(d_x, d_y, 0.0, 1.0, ds);
cudaMemcpy(h_y1, d_y, ds * sizeof(ft), cudaMemcpyDeviceToHost);
cudaCheckErrors("non-streams execution error");

et1 = dtime_usec(et1);
std::cout << "non-stream elapsed time: " << et1/(float)USECPSEC <<
std::endl;

#ifdef USE_STREAMS
cudaMemset(d_y, 0, ds * sizeof(ft));

unsigned long long et = dtime_usec(0);

#pragma omp parallel for

for (int i = 0; i < chunks; i++) { //depth-first launch
cudaMemcpyAsync(d_x + i * (ds / chunks), h_x + i * (ds / chunks), (ds /
chunks) * sizeof(ft), cudaMemcpyHostToDevice, streams[i %
num_streams]);
gaussian_pdf<<<((ds / chunks) + 255) / 256, 256, 0, streams[i %
num_streams]>>>(d_x + i * (ds / chunks), d_y + i * (ds / chunks), 0.0, 1.0,
ds / chunks);
cudaMemcpyAsync(h_y + i * (ds / chunks), d_y + i * (ds / chunks), (ds /
chunks) * sizeof(ft), cudaMemcpyDeviceToHost, streams[i %
num_streams]);
}
cudaDeviceSynchronize();
cudaCheckErrors("streams execution error");

et = dtime_usec(et);
for (int i = 0; i < ds; i++) {
if (h_y[i] != h_y1[i]) {
std::cout << "mismatch at " << i << " was: " << h_y[i] << " should be: "
<< h_y1[i] << std::endl;
return -1;
}
}

std::cout << "streams elapsed time: " << et/(float)USECPSEC <<

std::endl;
#endif

return 0;
}
# Multi-Process Service

On Cori GPU, first grab an interactive session. Make sure that you request
at least a few slots for MPI, but we'll only need one GPU.

```
module purge
module load cgpu gcc/8.3.0 cuda/11.4.0 openmpi/4.0.3
salloc -A ntrain -q shared --reservation=cuda_mps -C gpu -N 1 -n 4 -t 60 -c
4 --gpus=1
```

The test code used in the lecture is in `test.cu`, and it can be compiled with.

```
nvcc -o test -ccbin=mpicxx test.cu
```

If you're running somewhere where you don't have MPI, you can compile
the application without MPI as follows:
```
nvcc -DNO_MPI -o test test.cu
```

Then in all of the examples below, instead of launching with `mpirun`, use
the provided `run_no_mpi.sh` script, which launches 4 redundant copies of
the same process. This script might also be useful for systems like Summit
where you launch jobs from a different node than the compute node, where
`nsys jsrun ...` is less useful than `jsrun ... nsys`.

## Verifying the lecture findings

Your exercise is to try some of the experiments from the lecture and see if
you can reproduce the findings. Try the following experiments first, without
MPS (note that this application does take about 20 seconds to run, so be
patient):

```
nsys profile --stats=true -t nvtx,cuda -s none -o 1_rank_no_MPS_N_1e9 -f
true mpirun -np 1 ./test 1073741824
nsys profile --stats=true -t nvtx,cuda -s none -o 4_ranks_no_MPS_N_1e9 -f
true mpirun -np 4 ./test 1073741824
```

Verify from both the application stdout and from the profiling data that the
average kernel runtime is longer when using 4 ranks on the same GPU.

Now start MPS and repeat the above experiment with 4 ranks, verifying
that the average kernel runtime is about the same as in the 1 rank case
(again, consult both the stdout and the profiling data).

```
nvidia-cuda-mps-control -d
nsys profile --stats=true -t nvtx,cuda -s none -o 4_ranks_with_MPS_N_1e9
-f true mpirun -np 4 ./test 1073741824
```

Now verify that you can stop MPS and the original behavior returns.

```
echo "quit" | nvidia-cuda-mps-control
nsys profile --stats=true -t nvtx,cuda -s none -o 4_ranks_no_MPS_N_1e9 -f
true mpirun -np 4 ./test 1073741824
```

## Experimenting with problem size

Vary the problem size `N` until you've found the minimum size where you
can definitively say that MPS provides a clear benefit over the default
compute mode case.
#!/bin/bash

PROBLEM_SIZE=1073741824
NUM_RANKS=4

./test $PROBLEM_SIZE $NUM_RANKS &

./test $PROBLEM_SIZE $NUM_RANKS &
./test $PROBLEM_SIZE $NUM_RANKS &
./test $PROBLEM_SIZE $NUM_RANKS &
wait
#ifndef NO_MPI
#include <mpi.h>
#endif
#include <cstdio>
#include <chrono>
#include <iostream>

global void kernel (double* x, int N) {

int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i < N) {
x[i] = 2 * x[i];
}
}

int main(int argc, char** argv) {

#ifndef NO_MPI
int rank, num_ranks;

MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &num_ranks);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
#endif

// Total problem size

size_t N = 1024 * 1024 * 1024;

if (argc >= 2) {
N = atoi(argv[1]);
}

#ifdef NO_MPI
// If not using MPI, specify at command line how many "ranks" there are
int num_ranks = 1;
if (argc >= 3) {
num_ranks = atoi(argv[2]);
}
#endif

// Problem size per rank (assumes divisibility of N)

size_t N_per_rank = N / num_ranks;

double* x;
cudaMalloc((void**) &x, N_per_rank * sizeof(double));
// Number of repetitions

const int num_reps = 1000;

using namespace std::chrono;

auto start = high_resolution_clock::now();

int threads_per_block = 256;

size_t blocks = (N_per_rank + threads_per_block - 1) /
threads_per_block;

for (int i = 0; i < num_reps; ++i) {

kernel<<<blocks, threads_per_block>>>(x, N_per_rank);
cudaDeviceSynchronize();
}

auto end = high_resolution_clock::now();

auto duration = duration_cast<milliseconds>(end - start);

std::cout << "Time per kernel = " << duration.count() / (double) num_reps
<< " ms " << std::endl;

#ifndef NO_MPI
MPI_Finalize();
#endif
}
# **Task 1**

In this task we will explore using compute-sanitizer. A complete tiled

matrix-multiply example code is provided in the CUDA programming guide.
The *task1.cu* code includes this code with a few changes, and also a
main() routine to drive the operation. You are providing support services to
a cluster user community, and one of your users has presented this code
with the report that "CUDA error checking doesn't show any errors, but I'm
not getting the right answer. Please help!"

First, compile the code as follows, and run the code to observe the reported
behavior:

```
module load cuda
nvcc -arch=sm_70 task1.cu -o task1 -lineinfo
```

We are compiling the code for the GPU architecture being used (Volta SM
7.0 in this case) and we are also compiling with --lineinfo switch. You know
as a CUDA support engineer that this will be a useful switch when it comes
to using compute-sanitizer.

To run your code, we will use an LSF command:

```
bsub -W 10 -nnodes 1 -P <allocation_ID> -Is jsrun -n1 -a1 -c1 -g1 ./task1
```

Alternatively, you may want to create an alias for your bsub command in
order to make subsequent runs easier:

```
alias lsfrun='bsub -W 10 -nnodes 1 -P <allocation_ID> -Is jsrun -n1 -a1 -c1
-g1'
lsfrun ./task1
```

To build your code on NERSC's Cori-GPU

```
module load cgpu cuda/11.4.0
nvcc -arch=sm_70 task1.cu -o task1 -lineinfo
```

To run during the node reservation (10:30-12:30 Pacific time on September

14):
```
module load cgpu cuda/11.4.0
srun -C gpu -N 1 -n 1 -t 10 -A ntrain --reservation=cuda_debug -q shared
-G 1 -c 1 ./task1
```

or grab a GPU node first, then run interactively:

```
module load cgpu cuda
salloc -C gpu -N 1 -t 60 -A ntrain --reservation=cuda_debug -q shared -G 1
-c 1
srun -n 1 ./task1
```

To run outside of the node reservation window:

Same steps as above, but do not include "*--reservation=cuda_debug -q
shared*" in the srun or salloc commands.

If this code produces the correct matrix result, it will display:

```
Success!
```

But unfortunately we don't see that.

## Part A
Use basic *compute-sanitizer* functionality (no additional switches) to
identify a problem in the code. Using the output from *compute-sanitizer*,
identify the offending line of code. Fix this issue.

Hints:
- Remember that *-lineinfo* will cause compute-sanitizer (in this usage) to
report the actual line of code that is causing the problem
- Even if you didn't have this information (line number) could you use other
compute sanitizer information to quickly deduce the line to focus on in this
case? You could use the type of memory access violation as a clue.
Which lines of code in the kernel are doing that type of memory access
(hint, there is only one line of kernel code that is doing this.)
- Memory access problems are often caused by indexing errors. See if
you can spot an indexing error that may lead to this issue (hint - the classic
computer science "off by one" error.)
- Refer to *task1_solution.cu* if you get stuck

## Part B

Yay! You sorted out the problem, made the change to indexing, and now
the code prints "Success!" It's time to send the user on their way. Or is it?
Could there be other errors? Use additional compute-sanitizer switches
(*--tool racecheck*, *--tool initcheck*, *--tool synccheck*) to identify other
"latent" issues. Fix them.

Hints:
- The only tool that should report a problem at this point is the racecheck
tool.
- See if you can use the line number information embedded in the error
reports to identify the trouble "zone" in the kernel code
- Since you know that the racecheck tool reports race issues with shared
memory usage (only), and that these often involve missing synchronization,
can you identify the right place to insert appropriate synchronization into
the kernel code? Try experimenting. Inserting additional synchronization
into a CUDA kernel code usually does not break code correctness.
- Refer to *task1_solution.cu* if you get stuck

# **Task 2**

In this task we will explore basic usage of cuda-gdb. Once again you are
providing user support at a cluster help desk. The user has a code that
produces a *-inf* (negative floating-point infinity) result, and that is not
expected. The code consists of a transformation operation (one data
element created/modified per thread) followed by a reduction operation
(per-thread results summed together). The output of the reduction is *-inf*.
See if you can use *cuda-gdb* to identify the problem and rectify it.

To prepare to use cuda-gdb, its necessary to compile a debug project.

Therefore compile the code as follows:

```
nvcc -arch=sm_70 task2.cu -o task2 -G -g -std=c++14
```

You can then start debugging.

On Summit:

```
jsrun -n1 -a1 -c1 -g1 cuda-gdb ./task2
```

On Cori:

```
srun -n 1 ./task2
```

Don't forget that you cannot inspect device data until you are stopped after
a device-code breakpoint.
Once you have identified the source of the issue, see if you can propose a
simple code modification to work around the issue. If you get stuck on this
part (proposing a solution), refer to the *task2_solution.cu*. Careful code
inspection will likely immediately point out the issue, however the purpose
of this task is not actually to fix the code this way, but to learn to use
*cuda-gdb*.

Hints:
- The code is attempting to estimate the sum of an alternating harmonic
series (ahs), whose sum should be equal to the natural log of 2.
- The code is broken into two parts: the ahs term generator (produced by
the device function ahs) which takes only the index of the term to generate,
and a standard sweep parallel reduction, similar to the content in session 5
of this training series.
- Generally speaking, floating point arithmetic on *inf* or *-inf* inputs will
produce a *inf* or *-inf* output
- Decide whether you think the *-inf* is likely to appear as a result of the
initial transformation operation, or the subsequent reduction operation
- Use this reasoning to choose a point for an initial breakpoint
- Inspect data to see if you can observe *-inf* in any of the intermediate
data
- Use this observation to repeat the process of setting a breakpoint and
inspecting data
- Alternatively, work linearly through the code, setting an initial breakpoint
and single-stepping, to see if you can observe incorrect data
- You may need to change thread focus or observe data belonging to other
threads
- The reduction also offers the opportunity to tackle this problem via
divide-and-conquer, or binary searching
- Consider reducing the problem size (i.e. length of terms to generate the
estimate) to simplify your debug effort
#include <iostream>
// Thread block size
#define BLOCK_SIZE 32
// Matrices are stored in row-major order:
// M(row, col) = *(M.elements + row * M.stride + col)
typedef struct {
int width;
int height;
int stride;
float* elements;
} Matrix;

// Get a matrix element

__device__ float GetElement(const Matrix A, int row, int col)
{
return A.elements[row * A.stride + col];
}

// Set a matrix element

__device__ void SetElement(Matrix A, int row, int col,
float value)
{
A.elements[row * A.stride + col] = value;
}

// Get the BLOCK_SIZExBLOCK_SIZE sub-matrix Asub of A that is

// located col sub-matrices to the right and row sub-matrices down
// from the upper-left corner of A
__device__ Matrix GetSubMatrix(Matrix A, int row, int col)
{
Matrix Asub;
Asub.width = BLOCK_SIZE;
Asub.height = BLOCK_SIZE;
Asub.stride = A.stride;
Asub.elements = &A.elements[A.stride * BLOCK_SIZE * row
+ BLOCK_SIZE * col];
return Asub;
}

// Forward declaration of the matrix multiplication kernel

__global__ void MatMulKernel(const Matrix, const Matrix, Matrix);

// Matrix multiplication - Host code

// Matrix dimensions are assumed to be multiples of BLOCK_SIZE
void MatMul(const Matrix A, const Matrix B, Matrix C)
{
// Load A and B to device memory
Matrix d_A;
d_A.width = d_A.stride = A.width; d_A.height = A.height;
size_t size = A.width * A.height * sizeof(float);
cudaMalloc(&d_A.elements, size);
cudaMemcpy(d_A.elements, A.elements, size,
cudaMemcpyHostToDevice);
Matrix d_B;
d_B.width = d_B.stride = B.width; d_B.height = B.height;
size = B.width * B.height * sizeof(float);
cudaMalloc(&d_B.elements, size);
cudaMemcpy(d_B.elements, B.elements, size,
cudaMemcpyHostToDevice);

// Allocate C in device memory

Matrix d_C;
d_C.width = d_C.stride = C.width; d_C.height = C.height;
size = C.width * C.height * sizeof(float);
cudaMalloc(&d_C.elements, size);

// Invoke kernel
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);
MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);
// Read C from device memory
cudaMemcpy(C.elements, d_C.elements, size,
cudaMemcpyDeviceToHost);

// Free device memory

cudaFree(d_A.elements);
cudaFree(d_B.elements);
cudaFree(d_C.elements);
}

// Matrix multiplication kernel called by MatMul()

__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)
{
// Block row and column
int blockRow = blockIdx.y;
int blockCol = blockIdx.x;

// Each thread block computes one sub-matrix Csub of C

Matrix Csub = GetSubMatrix(C, blockRow, blockCol);

// Each thread computes one element of Csub

// by accumulating results into Cvalue
float Cvalue = 0;

// Thread row and column within Csub

int row = threadIdx.y;
int col = threadIdx.x;

// Loop over all the sub-matrices of A and B that are

// required to compute Csub
// Multiply each pair of sub-matrices together
// and accumulate the results
for (int m = 0; m < (A.width / BLOCK_SIZE); ++m) {

// Get sub-matrix Asub of A

Matrix Asub = GetSubMatrix(A, blockRow, m);

// Get sub-matrix Bsub of B

Matrix Bsub = GetSubMatrix(B, m, blockCol);

// Shared memory used to store Asub and Bsub respectively

__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

// Load Asub and Bsub from device memory to shared memory

// Each thread loads one element of each sub-matrix
As[row][col] = GetElement(Asub, row, col);
Bs[row][col] = GetElement(Bsub, row, col);

// Synchronize to make sure the sub-matrices are loaded

// before starting the computation
__syncthreads();
// Multiply Asub and Bsub together
for (int e = 0; e <= BLOCK_SIZE; ++e)
Cvalue += As[row][e] * Bs[e][col];

// Write Csub to device memory

// Each thread writes one element
SetElement(Csub, row, col, Cvalue);
}

int main(){
const int num_m = 3; // we need 3 matrices
const int side_dim = 128; // side dimension of square matrix
Matrix *m = new Matrix[num_m]; // allocate matrix storage part 1
for (int i = 0; i < num_m; i++){
m[i].width = m[i].height = m[i].stride = side_dim; // set matrix params
m[i].elements = new float[side_dim*side_dim]; // allocate matrix
storage part 2
if (i < 2) // initialize first two matrices
for (int j = 0; j < side_dim*side_dim; j++) m[i].elements[j] = 1.0f; }
MatMul(m[0], m[1], m[2]); // perform matrix-multiply
std::cout << cudaGetErrorString(cudaGetLastError()) << std::endl;
for (int i = 0; i < side_dim*side_dim; i++) // perform results checking
if (m[2].elements[i] != (float)side_dim) {std::cout << "Mismatch at
index: " << i << " expected: " << (float)side_dim << " got: " <<
m[2].elements[i] << std::endl; return 0;}
std::cout << "Success!" << std::endl;
for (int i = 0; i < num_m; i++)
delete[] m[i].elements;
delete[] m;
return 0;
}
#include <iostream>
// Thread block size
#define BLOCK_SIZE 32

// Matrices are stored in row-major order:

// M(row, col) = *(M.elements + row * M.stride + col)
typedef struct {
int width;
int height;
int stride;
float* elements;
} Matrix;

// Get a matrix element

__device__ float GetElement(const Matrix A, int row, int col)
{
return A.elements[row * A.stride + col];
}
// Set a matrix element
__device__ void SetElement(Matrix A, int row, int col,
float value)
{
A.elements[row * A.stride + col] = value;
}

// Get the BLOCK_SIZExBLOCK_SIZE sub-matrix Asub of A that is

// Forward declaration of the matrix multiplication kernel

__global__ void MatMulKernel(const Matrix, const Matrix, Matrix);

// Matrix multiplication - Host code

// Allocate C in device memory

Matrix d_C;
d_C.width = d_C.stride = C.width; d_C.height = C.height;
size = C.width * C.height * sizeof(float);
cudaMalloc(&d_C.elements, size);

// Invoke kernel
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);
MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);

// Read C from device memory

cudaMemcpy(C.elements, d_C.elements, size,
cudaMemcpyDeviceToHost);

// Free device memory

cudaFree(d_A.elements);
cudaFree(d_B.elements);
cudaFree(d_C.elements);
}

// Matrix multiplication kernel called by MatMul()

__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)
{
// Block row and column
int blockRow = blockIdx.y;
int blockCol = blockIdx.x;
// Each thread block computes one sub-matrix Csub of C
Matrix Csub = GetSubMatrix(C, blockRow, blockCol);

// Each thread computes one element of Csub

// by accumulating results into Cvalue
float Cvalue = 0;

// Thread row and column within Csub

int row = threadIdx.y;
int col = threadIdx.x;

// Loop over all the sub-matrices of A and B that are

// required to compute Csub
// Multiply each pair of sub-matrices together
// and accumulate the results
for (int m = 0; m < (A.width / BLOCK_SIZE); ++m) {

// Get sub-matrix Asub of A

Matrix Asub = GetSubMatrix(A, blockRow, m);

// Get sub-matrix Bsub of B

Matrix Bsub = GetSubMatrix(B, m, blockCol);

// Shared memory used to store Asub and Bsub respectively

__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

// Load Asub and Bsub from device memory to shared memory

// Each thread loads one element of each sub-matrix
As[row][col] = GetElement(Asub, row, col);
Bs[row][col] = GetElement(Bsub, row, col);

// Synchronize to make sure the sub-matrices are loaded

// before starting the computation
__syncthreads();
// Multiply Asub and Bsub together
for (int e = 0; e < BLOCK_SIZE; ++e)
Cvalue += As[row][e] * Bs[e][col];

// Synchronize to make sure that the preceding

// computation is done before loading two new
// sub-matrices of A and B in the next iteration
__syncthreads();
}

// Write Csub to device memory

// Each thread writes one element
SetElement(Csub, row, col, Cvalue);
}

// alternating harmonic series:

https://en.wikipedia.org/wiki/Harmonic_series_(mathematics)#Alternating_h
armonic_series
// compute alternating harmonic series member based on index n
__device__ auto ahs(size_t n){ return ((n&1)?1:-1)/(double)n;}

// blocksize must be a power of 2, less than or equal to 1024

#define BLOCK_SIZE 512

// estimate summation of alternating harmonic series

template <typename T>
__global__ void estimate_sum_ahs(size_t length, T *sum){
__shared__ T smem[BLOCK_SIZE];
size_t idx = blockDim.x*blockIdx.x+threadIdx.x;
smem[threadIdx.x] = (idx < length)?ahs(idx):0;

for (int i = blockDim.x>>1; i > 0; i >>= 1){

__syncthreads();
if (threadIdx.x < i) smem[threadIdx.x] += smem[threadIdx.x+i];}

if (threadIdx.x == 0) atomicAdd(sum, smem[0]);

}

typedef double ft;

int main(int argc, char* argv[]){

// alternating harmonic series:

// blocksize must be a power of 2, less than or equal to 1024

#define BLOCK_SIZE 512

// estimate summation of alternating harmonic series

for (int i = blockDim.x>>1; i > 0; i >>= 1){

__syncthreads();
if (threadIdx.x < i) smem[threadIdx.x] += smem[threadIdx.x+i];}

if (threadIdx.x == 0) atomicAdd(sum, smem[0]);

}

typedef double ft;

int main(int argc, char* argv[]){

size_t my_length = 1048576; // allow user to override default estimation
length with command-line argument
if (argc > 1) my_length = atol(argv[1]);
ft *sum;
cudaError_t err = cudaMallocManaged(&sum, sizeof(ft));
if (err != cudaSuccess) {std::cout << "Error: " << cudaGetErrorString(err)
<< std::endl; return 0;}
*sum = 0;
dim3 block(BLOCK_SIZE);
dim3 grid((my_length+block.x-1)/block.x);
estimate_sum_ahs<<<grid, block>>>(my_length, sum);
err = cudaDeviceSynchronize();
if (err != cudaSuccess) {std::cout << "Error: " << cudaGetErrorString(err)
<< std::endl; return 0;}
std::cout << "Estimated value: " << *sum << " Expected value: " << log(2)
<< std::endl;
return 0;
}
# CUDA Graphs
In this homework we will look at two different codes, both using Cuda
Graphs. These codes consist of small kernels and could see some benefit
from Cuda Graphs given the particular workflow. The two codes are
axpy_stream_capture and axpy_cublas, each having a verison with_fixme
and from_scratch. We recommend starting with the with_fixme versions of
the two code, and then trying the from_scratch if you want a challenge. The
with_fixme versions will have spots where you will need to fix to get the
code to run, but the framework is all set in place. The from_scratch
versions require you to implement the graph set up and logic by hand.

You can refer to the solutions in the Solutions directory for help/hints when
stuck.

### Task 1
#### Stream Capture
This task will be an example of how to use stream capture with Cuda
Graphs. We will be creating a graph from a sequence of kernel launchs
across two streams.

We will be looking to implement the following graph, which can be helpful to

see visually:

![](graph_stream_capture.png)

This is the same example from the slides, feel free to refer to them for help
and hints.

Go ahead and take a look at the code now to get a sense of the new Graph
API calls. On first pass, ignore the Graph APIs and try get a feel for the
underlying code and what it is doing. The kernels themselves are not doing
any specific math, but simply represent some random small kernel.
Remember to think about the function of the two streams and refer back to
the picture here to make sure you see the inherient dependencies created
by the Cuda Events.
`bool graphCreated=false;` will be our method to set up the graph on the
first pass only (for loop iteration 0), then go straight to launching the graph
in each subsequent iteration (1 - (N-1)).

An important distinction is the difference between the type `cudaGraph_t`

and `cudaGraphExec_t`. `cudaGraph_t` is used to define the shape and
the arguments of the overall graph and `cudaGraphExec_t` is a callable
instance of the graph, which has gone through the instantiate step.

First, to compile the code on Summit:

```
module load cuda/11.4.0
nvcc -arch=sm_70 axpy_stream_capture_with_fixme.cu -o
axpy_stream_capture_with_fixme
```

We are compiling the code for the GPU architecture being used (Volta SM
7.0 in this case). Cuda Graphs has been included in all Cuda Toolkits after
Cuda 10, but some features may be version-dependent.

To run your code, we will use an LSF command:

```
bsub -W 10 -nnodes 1 -P <allocation_ID> -Is jsrun -n1 -a1 -c1 -g1
./axpy_stream_capture_with_fixme
```

Alternatively, you may want to create an alias for your bsub command in
order to make subsequent runs easier:

```
alias lsfrun='bsub -W 10 -nnodes 1 -P <allocation_ID> -Is jsrun -n1 -a1 -c1
-g1'
lsfrun ./axpy_stream_capture_with_fixme
```

To build your code on NERSC's Cori-GPU

```
module load cgpu cuda/11.4.0
nvcc -arch=sm_70 axpy_stream_capture_with_fixme.cu -o
axpy_stream_capture_with_fixme
```

To run during the node reservation (10:30-12:30 Pacific time on October

13):
```
module load cgpu cuda/11.4.0
srun -C gpu -N 1 -n 1 -t 10 -A ntrain2 --reservation=cuda_graphs -q shared
-G 1 -c 1 ./axpy_stream_capture_with_fixme
```

or grab a GPU node first, then run interactively:

```
module load cgpu cuda
salloc -C gpu -N 1 -t 60 -A ntrain2 --reservation=cuda_graphs -q shared -G
1 -c 1
srun -n 1 ./axpy_stream_capture_with_fixme
```

To run outside of the node reservation window:

Same steps as above, but do not include "*--reservation=cuda_graphs -q
shared*" in the srun or salloc commands.

FIXMEs
1. cudaGraphCreate(FIXME, 0);
2. cudaGraphInstantiate(FIXME, graph, NULL, NULL, 0);
3. graphCreated = FIXME;
4. cudaGraphLaunch(FIXME, streams[0]);
After you have complete the FIXME, you can see a time printed out after
you run. This is the total time from running the graph 1000 times. You can
compare that to the time from file axpy_stream_capture_timer.cu, which is
the same code running the Cuda work in streams instead of the graph.
These examples are primarily to introduce the topic and API, so they are
not particularly performant. Given this, you should still be able to see a
small preformance increase using the graph from the launch overhead
savings. The instantiation phase is not included in the timing however, so it
is not exactly a apple-to-apples comparison. It merely highlights the ideas
we saw in the slides.

### Task 2
#### Explicit Graph Creation w/ Library Call
In this task, we will look at a few of the explicit graph creation API and how
to capture a library call with stream capture. A key to this example is
remembering while we are using both explicit graph creation and stream
capture, both are just ways of defining to a `cudaGraph_t` which we then
instantiate into a `cudaGraphExec_t`.

We are creating 2 kernel nodes and a child graph derived from a cuBLAS
axpy function call. See the diagram below for a visual.

![](graph_with_library_call.png)

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__GRAPH.
html

This is the documentaion of the current Cuda toolkit graph management

API. You can complete this example without consulting the docs by using
the slides and context clues in the code, or but taking a look at the
definition of `cudaGraphAddChildGraphNode` may help you if you are
stuck with the FIXME.
Unlike the first example, this code is harder to "get the picture" by ignoring
the graph API. Infact, without the graph API calls, there is no runnable
program! This level of code change adds a lot of control at the price of code
change and loss of readability for users unfamilar with Cuda Graphs.

The API is a bit tricky because it is quite different from anything else in
Cuda at first, but the patterns are actually quite familar. It is just a different
way to define Cuda work.

We will follow the same instructions as before to compile, plus this time
adding -lcublas to include the library.

```
nvcc -arch=sm_70 -lcublas axpy_cublas_with_fixme.cu -o
axpy_cublas_with_fixme
```

Using the alias we created for Summit, we can run as follows:

```
lsfrun ./axpy_cublas_with_fixme
```

Or for Cori GPU, on a interactive node:

```
srun -n 1 ./axpy_stream_capture_with_fixme
```

Take a look at the axpy_cublas_with_fixme.cu and try to get the FIXME to

compile and run. And please consult the diagram for the flow of the
program.

FIXME
1. cudaGraphCreate(FIXME, 0);
2. cudaGraphAddChildGraphNode(FIXME, graph, FIXME,
nodeDependencies.size(), libraryGraph);
3. cudaGraphLaunch(FIXME, stream1);

#include <stdio.h>
#include <vector>
#include <cuda_runtime_api.h>
#include <cublas_v2.h>

// error checking macro

#define N 500000

// Simple short kernels

__global__
void kernel_a(float* x, float* y){
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) y[idx] += 1;
}

__global__
void kernel_c(float* x, float* y){
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) y[idx] += 1;
}

int main(){

cudaStream_t stream1;

cudaStreamCreateWithFlags(&stream1, cudaStreamNonBlocking);

cublasHandle_t cublas_handle;
cublasCreate(&cublas_handle);
cublasSetStream(cublas_handle, stream1);

// Set up host data and initialize

float* h_x;
float* h_y;

h_x = (float) malloc(N sizeof(float));

h_y = (float*) malloc(N * sizeof(float));

for (int i = 0; i < N; ++i){

h_x[i] = float(i);
h_y[i] = float(i);
}

// Print out the first 25 values of h_y

for (int i = 0; i < 25; ++i){
printf("%2.0f ", h_y[i]);
}
printf("\n");

// Set up device data

float* d_x;
float* d_y;
float d_a = 5.0;
cudaMalloc((void**) &d_x, N * sizeof(float));
cudaMalloc((void**) &d_y, N * sizeof(float));
cudaCheckErrors("cudaMalloc failed");

cublasSetVector(N, sizeof(h_x[0]), h_x, 1, d_x, 1); // similar to

cudaMemcpyHtoD
cublasSetVector(N, sizeof(h_y[0]), h_y, 1, d_y, 1); // similar to
cudaMemcpyHtoD
cudaCheckErrors("cublasSetVector failed");

// Set up graph
cudaGraph_t graph; // main graph
cudaGraph_t libraryGraph; // sub graph for cuBLAS call
std::vector<cudaGraphNode_t> nodeDependencies;
cudaGraphNode_t kernelNode1, kernelNode2, libraryNode;

cudaKernelNodeParams kernelNode1Params {0};

cudaKernelNodeParams kernelNode2Params {0};

cudaGraphCreate(&graph, 0); // create the graph

cudaCheckErrors("cudaGraphCreate failure");

// kernel_a and kernel_c use same args

void *kernelArgs[2] = {(void *)&d_x, (void *)&d_y};

int threads = 512;

int blocks = (N + (threads - 1) / threads);

// Adding 1st node, kernel_a, as head node of graph

kernelNode1Params.func = (void *)kernel_a;
kernelNode1Params.gridDim = dim3(blocks, 1, 1);
kernelNode1Params.blockDim = dim3(threads, 1, 1);
kernelNode1Params.sharedMemBytes = 0;
kernelNode1Params.kernelParams = (void **)kernelArgs;
kernelNode1Params.extra = NULL;

cudaGraphAddKernelNode(&kernelNode1, graph, NULL,

0, &kernelNode1Params);
cudaCheckErrors("Adding kernelNode1 failed");

nodeDependencies.push_back(kernelNode1); // manage dependecy vector

// Adding 2nd node, libraryNode, with kernelNode1 as dependency

cudaStreamBeginCapture(stream1, cudaStreamCaptureModeGlobal);
cudaCheckErrors("Stream capture begin failure");

// Library call
cublasSaxpy(cublas_handle, N, &d_a, d_x, 1, d_y, 1);
cudaCheckErrors("cublasSaxpy failure");

cudaStreamEndCapture(stream1, &libraryGraph);
cudaCheckErrors("Stream capture end failure");

cudaGraphAddChildGraphNode(&libraryNode, graph,
nodeDependencies.data(),
nodeDependencies.size(), libraryGraph);
cudaCheckErrors("Adding libraryNode failed");

nodeDependencies.clear();
nodeDependencies.push_back(libraryNode); // manage dependency vector

// Adding 3rd node, kernel_c, with libraryNode as dependency

kernelNode2Params.func = (void *)kernel_c;
kernelNode2Params.gridDim = dim3(blocks, 1, 1);
kernelNode2Params.blockDim = dim3(threads, 1, 1);
kernelNode2Params.sharedMemBytes = 0;
kernelNode2Params.kernelParams = (void **)kernelArgs;
kernelNode2Params.extra = NULL;
cudaGraphAddKernelNode(&kernelNode2, graph,
nodeDependencies.data(),
nodeDependencies.size(), &kernelNode2Params);
cudaCheckErrors("Adding kernelNode2 failed");

nodeDependencies.clear();
nodeDependencies.push_back(kernelNode2); // manage dependency
vector

cudaGraphNode_t *nodes = NULL;

size_t numNodes = 0;
cudaGraphGetNodes(graph, nodes, &numNodes);
cudaCheckErrors("Graph get nodes failed");
printf("Number of the nodes in the graph = %zu\n", numNodes);

cudaGraphExec_t instance;
cudaGraphInstantiate(&instance, graph, NULL, NULL, 0);
cudaCheckErrors("Graph instantiation failed");

// Launch the graph instance 100 times

for (int i = 0; i < 100; ++i){
cudaGraphLaunch(instance, stream1);
cudaStreamSynchronize(stream1);
}
cudaCheckErrors("Graph launch failed");
cudaDeviceSynchronize();

// Copy memory back to host

cudaMemcpy(h_y, d_y, N, cudaMemcpyDeviceToHost);
cudaCheckErrors("Finishing memcpy failed");

cudaDeviceSynchronize();

// Print out the first 25 values of h_y

for (int i = 0; i < 25; ++i){
printf("%2.0f ", h_y[i]);
}
printf("\n");

return 0;

}
#include <stdio.h>
#include <cuda_runtime_api.h>
#include <ctime>
#include <ratio>
#include <chrono>
#include <iostream>

// error checking macro

#define N 500000

// Simple short kernels

__global__
void kernel_a(float * x, float * y){
int idx = blockIdx.x*blockDim.x + threadIdx.x;
if (idx < N) y[idx] = 2.0*x[idx] + y[idx];
}

__global__
void kernel_b(float * x, float * y){
int idx = blockIdx.x*blockDim.x + threadIdx.x;
if (idx < N) y[idx] = 2.0*x[idx] + y[idx];

__global__
void kernel_c(float * x, float * y){
int idx = blockIdx.x*blockDim.x + threadIdx.x;
if (idx < N) y[idx] = 2.0*x[idx] + y[idx];

__global__
void kernel_d(float * x, float * y){
int idx = blockIdx.x*blockDim.x + threadIdx.x;
if (idx < N) y[idx] = 2.0*x[idx] + y[idx];

int main(){

// Set up and create events

cudaEvent_t event1;
cudaEvent_t event2;

cudaEventCreateWithFlags(&event1, cudaEventDisableTiming);
cudaEventCreateWithFlags(&event2, cudaEventDisableTiming);

// Set up and create streams

const int num_streams = 2;
cudaStream_t streams[num_streams];

for (int i = 0; i < num_streams; ++i){

cudaStreamCreateWithFlags(&streams[i], cudaStreamNonBlocking);
}

// Set up and initialize host data

float* h_x;
float* h_y;

h_x = (float) malloc(N sizeof(float));

h_y = (float*) malloc(N * sizeof(float));

for (int i = 0; i < N; ++i){

h_x[i] = (float)i;
h_y[i] = (float)i;
// printf("%2.0f ", h_x[i]);
}
printf("\n");

// Set up device data

float* d_x;
float* d_y;

cudaMalloc((void**) &d_x, N * sizeof(float));

cudaMalloc((void**) &d_y, N * sizeof(float));
cudaCheckErrors("cudaMalloc failed");

cudaMemcpy(d_x, h_x, N, cudaMemcpyHostToDevice);

cudaMemcpy(d_y, h_y, N, cudaMemcpyHostToDevice);
cudaCheckErrors("cudaMalloc failed");

// Set up graph
bool graphCreated=false;
cudaGraph_t graph;
cudaGraphExec_t instance;

cudaGraphCreate(&graph, 0);

int threads = 512;

int blocks = (N + (threads - 1) / threads);

// Launching work
for (int i = 0; i < 100; ++i){
if (graphCreated == false){
// If first pass, starting stream capture
cudaStreamBeginCapture(streams[0],
cudaStreamCaptureModeGlobal);
cudaCheckErrors("Stream begin capture failed");

kernel_a<<<blocks, threads, 0, streams[0]>>>(d_x, d_y);

cudaCheckErrors("Kernel a failed");

cudaEventRecord(event1, streams[0]);
cudaCheckErrors("Event record failed");

kernel_b<<<blocks, threads, 0, streams[0]>>>(d_x, d_y);

cudaCheckErrors("Kernel b failed");

cudaStreamWaitEvent(streams[1], event1);
cudaCheckErrors("Event wait failed");

kernel_c<<<blocks, threads, 0, streams[1]>>>(d_x, d_y);

cudaCheckErrors("Kernel c failed");

cudaEventRecord(event2, streams[1]);
cudaCheckErrors("Event record failed");

cudaStreamWaitEvent(streams[0], event2);
cudaCheckErrors("Event wait failed");
kernel_d<<<blocks, threads, 0, streams[0]>>>(d_x, d_y);
cudaCheckErrors("Kernel d failed");

cudaStreamEndCapture(streams[0], &graph);
cudaCheckErrors("Stream end capture failed");

// Creating the graph instance

cudaGraphInstantiate(&instance, graph, NULL, NULL, 0);
cudaCheckErrors("instantiating graph failed");

graphCreated = true;
}
// Launch the graph instance
cudaGraphLaunch(instance, streams[0]);
cudaCheckErrors("Launching graph failed");
cudaStreamSynchronize(streams[0]);
}

// Count how many nodes we had

cudaGraphNode_t *nodes = NULL;
size_t numNodes = 0;
cudaGraphGetNodes(graph, nodes, &numNodes);
cudaCheckErrors("Graph get nodes failed");
printf("Number of the nodes in the graph = %zu\n", numNodes);

// Below is for timing

cudaDeviceSynchronize();

using namespace std::chrono;

high_resolution_clock::time_point t1 = high_resolution_clock::now();

for (int i = 0; i < 1000; ++i){

cudaGraphLaunch(instance, streams[0]);
cudaCheckErrors("Launching graph failed");
//cudaStreamSynchronize(streams[0]);
}

cudaDeviceSynchronize();
high_resolution_clock::time_point t2 = high_resolution_clock::now();

duration<double> total_time = duration_cast<duration<double>>(t2 - t1);

std::cout << "Time " << total_time.count() << " s" << std::endl;

// Copy data back to host

cudaMemcpy(h_y, d_y, N, cudaMemcpyDeviceToHost);
cudaCheckErrors("Finishing memcpy failed");

cudaDeviceSynchronize();

// Print out the first 25 values of h_y

for (int i = 0; i < 25; ++i){
printf("%2.0f ", h_y[i]);
}
printf("\n");

return 0;
}
__global__ void transposeCoalesced(float *odata, const float *idata)
{
__shared__ float tile[TILE_DIM][TILE_DIM];

int x = blockIdx.x * TILE_DIM + threadIdx.x;

int y = blockIdx.y * TILE_DIM + threadIdx.y;
int width = gridDim.x * TILE_DIM;

for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)

tile[threadIdx.y+j][threadIdx.x] = idata[(y+j)*width + x];

__syncthreads();

x = blockIdx.y * TILE_DIM + threadIdx.x; // transpose block offset

y = blockIdx.x * TILE_DIM + threadIdx.y;

for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)

odata[(y+j)*width + x] = tile[threadIdx.x][threadIdx.y + j];
}

LP 1,,1
No ratings yet
LP 1,,1
5 pages
CUDA PPT Anurita Unit3
No ratings yet
CUDA PPT Anurita Unit3
42 pages
cuda
No ratings yet
cuda
4 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
3-CUDA
No ratings yet
3-CUDA
5 pages
Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster
No ratings yet
Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster
17 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
Ejercicio 2 Práctica 3: CUDA Desempeño en Función de La Homogeneidad para Acceder A Memoria y de La Regularidad Del Código
No ratings yet
Ejercicio 2 Práctica 3: CUDA Desempeño en Función de La Homogeneidad para Acceder A Memoria y de La Regularidad Del Código
8 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
Cuda C/C++ Basics: NVIDIA Corporation
No ratings yet
Cuda C/C++ Basics: NVIDIA Corporation
67 pages
PDC assignment
No ratings yet
PDC assignment
9 pages
2023-CSC14120-Lecture01-CUDAIntroduction
No ratings yet
2023-CSC14120-Lecture01-CUDAIntroduction
32 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
CUDA Programming Invert
No ratings yet
CUDA Programming Invert
36 pages
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
No ratings yet
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
3 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
3-computation
No ratings yet
3-computation
28 pages
Aca Lab Manual Final
No ratings yet
Aca Lab Manual Final
28 pages
CUDA Putting It All Together
No ratings yet
CUDA Putting It All Together
39 pages
Introduction To CUDA C 3
No ratings yet
Introduction To CUDA C 3
67 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
Csnb594csnb4423 Lab 5 01a Harveen Velan Sw0104101
No ratings yet
Csnb594csnb4423 Lab 5 01a Harveen Velan Sw0104101
19 pages
Rishi
No ratings yet
Rishi
30 pages
CUDAProgModel
No ratings yet
CUDAProgModel
24 pages
Lab 1 Parallel
No ratings yet
Lab 1 Parallel
4 pages
Using CUDA
No ratings yet
Using CUDA
57 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
01 Cuda c Basics
No ratings yet
01 Cuda c Basics
32 pages
Introduccion CUDA C
No ratings yet
Introduccion CUDA C
51 pages
CUDA_1
No ratings yet
CUDA_1
45 pages
20 Quiz 14
No ratings yet
20 Quiz 14
12 pages
cuda_mode_lecture2
No ratings yet
cuda_mode_lecture2
33 pages
BECOA157 Parallel Matrix Multiplication
No ratings yet
BECOA157 Parallel Matrix Multiplication
3 pages
Google Colab Solution Activity
No ratings yet
Google Colab Solution Activity
5 pages
Cuda Firstprograms PDF
No ratings yet
Cuda Firstprograms PDF
6 pages
Vector Addition
No ratings yet
Vector Addition
3 pages
cs239 Ejer1
No ratings yet
cs239 Ejer1
2 pages
CUDA
No ratings yet
CUDA
33 pages
Cuda Notes From Udacity Lecture
No ratings yet
Cuda Notes From Udacity Lecture
3 pages
HPC (Pra 04)
No ratings yet
HPC (Pra 04)
11 pages
HPC Int2 Key
No ratings yet
HPC Int2 Key
10 pages
Introduction To CUDA C
No ratings yet
Introduction To CUDA C
67 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Part2 22
No ratings yet
Part2 22
97 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
Group A Assignment 4 (A) : Two Large Vectors
No ratings yet
Group A Assignment 4 (A) : Two Large Vectors
5 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
GPU Series III CUDA Compilation Host Side 1721302802
No ratings yet
GPU Series III CUDA Compilation Host Side 1721302802
8 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
Intro To CUDA
No ratings yet
Intro To CUDA
76 pages
ECE408 S19 ZJUI Exam1 Study Guide
No ratings yet
ECE408 S19 ZJUI Exam1 Study Guide
25 pages
Pgi Cuda Tutorial
No ratings yet
Pgi Cuda Tutorial
58 pages
Threads
No ratings yet
Threads
54 pages
L06_GPGPU_CUDA_Programming_1
No ratings yet
L06_GPGPU_CUDA_Programming_1
23 pages
Lab Report 6
No ratings yet
Lab Report 6
12 pages
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
No ratings yet
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
7 pages
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
NTDCL Junior Engineer (Electrical) Test Paper 2014
No ratings yet
NTDCL Junior Engineer (Electrical) Test Paper 2014
4 pages
Unit-2 Self Management Skills
No ratings yet
Unit-2 Self Management Skills
7 pages
Fbs Week 5 Grade 7 8 Leap
No ratings yet
Fbs Week 5 Grade 7 8 Leap
4 pages
False
No ratings yet
False
18 pages
Bee Lok
No ratings yet
Bee Lok
1 page
Microstation V8 2004 Edition: Pattern/Graphics Issue
No ratings yet
Microstation V8 2004 Edition: Pattern/Graphics Issue
7 pages
222 Ways To Avoid Very
100% (1)
222 Ways To Avoid Very
21 pages
Phy 15 Notes
No ratings yet
Phy 15 Notes
8 pages
Lesson Plan For Grade 3 English (Aiza A. Miranda-Bse3-2E)
No ratings yet
Lesson Plan For Grade 3 English (Aiza A. Miranda-Bse3-2E)
10 pages
2nd Lecture On Skeletal Muscle Physiology by Dr. Mudassar Ali Roomi
100% (1)
2nd Lecture On Skeletal Muscle Physiology by Dr. Mudassar Ali Roomi
29 pages
Foundations of Library and Information Science 4th Edition Richard E. Rubin download pdf
100% (1)
Foundations of Library and Information Science 4th Edition Richard E. Rubin download pdf
81 pages
Job Safety Environmental Analysis Pre-Task Briefing
No ratings yet
Job Safety Environmental Analysis Pre-Task Briefing
5 pages
Motor Feeder Cable & Cable Tray Sizing and Data
No ratings yet
Motor Feeder Cable & Cable Tray Sizing and Data
5 pages
140 H
100% (1)
140 H
7 pages
CI3907 Granosik
No ratings yet
CI3907 Granosik
3 pages
Introduction To Statistical Learning
No ratings yet
Introduction To Statistical Learning
16 pages
An Evaluation of Recent Strategic Environmental Assessment Practice in Brazil
No ratings yet
An Evaluation of Recent Strategic Environmental Assessment Practice in Brazil
28 pages
LitCharts After
No ratings yet
LitCharts After
8 pages
EDA PP
No ratings yet
EDA PP
3 pages
10ka Miniature Circuit Breakers
No ratings yet
10ka Miniature Circuit Breakers
7 pages
Limits One Shot JEE
No ratings yet
Limits One Shot JEE
170 pages
Fans PDF
No ratings yet
Fans PDF
1 page
Alexander Wendt
No ratings yet
Alexander Wendt
3 pages
FURUNO ECHOSOUNDER FCV581L OME-F
No ratings yet
FURUNO ECHOSOUNDER FCV581L OME-F
34 pages
Shyu 2011
No ratings yet
Shyu 2011
18 pages
Unit 22 L3 Assignment B Lessson
No ratings yet
Unit 22 L3 Assignment B Lessson
34 pages
Detailed Lesson Plan in BPP
100% (1)
Detailed Lesson Plan in BPP
6 pages
FA1600C v2 Installation Manual
No ratings yet
FA1600C v2 Installation Manual
230 pages
Ed698 14 2
No ratings yet
Ed698 14 2
5 pages
Aesthetic Experience: and Literary Hermeneutics
No ratings yet
Aesthetic Experience: and Literary Hermeneutics
389 pages