0% found this document useful (0 votes)
123 views

Assignment III - Advanced CUDA

This document describes Assignment III: Advanced CUDA. It includes 3 exercises: 1) Implementing an edge detector for images using shared memory on the GPU. 2) Applying a Gaussian filter for noise reduction by implementing convolution on the GPU. 3) Detecting edges in the image by implementing the Sobel filter on the GPU to approximate gradients. The assignment is to be submitted as a PDF report including answers to questions and a link to code in a public GitHub repository with a specific folder structure. It is completed individually or in groups of two.

Uploaded by

Badajuan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
123 views

Assignment III - Advanced CUDA

This document describes Assignment III: Advanced CUDA. It includes 3 exercises: 1) Implementing an edge detector for images using shared memory on the GPU. 2) Applying a Gaussian filter for noise reduction by implementing convolution on the GPU. 3) Detecting edges in the image by implementing the Sobel filter on the GPU to approximate gradients. The assignment is to be submitted as a PDF report including answers to questions and a link to code in a public GitHub repository with a specific folder structure. It is completed individually or in groups of two.

Uploaded by

Badajuan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

31/10/22, 16:21 Assignment III: Advanced CUDA

Assignment III: Advanced CUDA

Inlämningsdatum
25 nov 2020 av 23.59
Poäng
1
Lämnar in
en filuppladdning
Filtyper
pdf
Tillgänglig
9 nov 2020 kl 9:00–12 jan 2021 kl 23.59

Den här uppgiften låstes 12 jan 2021 kl 23.59.

In this assignment, we will go through the concepts of the GPU memory hierarchy, optimization
techniques, and CUDA libraries provided by NVIDIA. Some exercises will heavily use the NVIDIA
Profiler for timing, read more about nvprof here.

To submit your assignment, prepare a small report that answers the questions in the exercises.
Submit the report as a PDF with the following filename:

appgpu20_HW3_GroupNumberFromCanvas.pdf

Submit your code in a Git repository, and make it public so we can access it. Use the following folder
structure and include the link in your report:

Assignment_3/ex_ExerciseNumber/your_source_code_files

The assignment is solved and submitted in a group of one or two according to Canvas
signup.

Exercise 1 - CUDA Edge Detector using shared memory


In this exercise, we will implement an edge detector for bitmap images. We will also provide a
reference image and reference output for you to check the answer.

Comment out the function call to CPU versions during development, otherwise, you may
still get a correct output file even if the kernel does not execute.
Run the program from the root of the exercise folder, do not modify the folder names or
folder structure!

Obtain the skeleton code, compile and run like the following, assuming that you are on Tegner with all
modules loaded and you are already allocated with a K420 GPU. Modify the architecture flag if you
are using other GPUs (i.e. -arch=sm_50 for workstations in lab rooms and -arch=sm_37 for K80).

$ module load git

$ git clone https://github.com/steven-chien/DD2360-HT19.git

$ cd DD2360-HT19/Assignment_3/ex_1

$ nvcc -O3 -arch=sm_30 hw3_ex1.cu -o hw3_ex1.out

$ srun -n 1 ./hw3_ex1.out images/hw3.bmp

https://canvas.kth.se/courses/20917/assignments/135985?module_item_id=270560 1/12
31/10/22, 16:21 Assignment III: Advanced CUDA

If you are running the exercise on Tegner, we also recommend that you add the -Y flag when you ssh
to enable X11 forward so that you can display the images directly through the terminal.

Do not change the folder structure or rename anything.

Mapping of input data

The Bitmap (BMP) image format is an uncompressed format. Each BMP file contains an encoded
header that specifies the {width, height} of the image, the number of bits per color plane, and
more. After the header, a subsequent string of interleaved color values follows (e.g., in BGR). Here is
a simplified example of how a 3x3 image looks like inside the file:

Each BGR, from Blue / Green / Red, represents an 8-bit pixel value in the image that encodes the
intensity of each channel. The values span from 0 to 255 in the case of BMP 24bpp2, being 0 the
absence of representation by this color and 255 the full representation. The decoding of the image
is already done for you. The resulting data follows the exact mapping as above, flattened as a
1D array.

Grayscale conversion

The first step of our edge detector is to discard the color information and work directly in black &
white. Since a Bitmap image uses a BGR color space, where the combination of the individual
intensities of each color value represents the final intensity of the specific pixels, we combine these
pixels to generate a BMP 8bpp image in grayscale. In other words, we want only 8 bits per pixel.

For the conversion to grayscale, we are going to use the Colorimetric


(https://en.wikipedia.org/wiki/Grayscale) (luminance-preserving) method by applying the following
conversion using the weighted sum of the three color values:

(https://github.com/PDC-support/cuda-lab-exercises/blob/master/lab_2/html/yuv.png) Note that since the color


information is discarded, each pixel can now be represented in one value instead of three, meaning
the output array will have size width*height. If the program is executed successfully, an image will be
created:

images/hw3_result_1.bmp

The image can be viewed with


https://canvas.kth.se/courses/20917/assignments/135985?module_item_id=270560 2/12
31/10/22, 16:21 Assignment III: Advanced CUDA

$ display -resize 1280x720 images/hw3_result_1.bmp

You will get a new window that displays the converted image in black & white, such as this:

(https://github.com/PDC-support/cuda-lab-exercises/blob/master/lab_2/html/fig2.jpeg)

Do not display the image without resizing since it is very large.

TODO: Find the declaration of gpu_greyscale() in hw3_ex1.cu and implement the GPU version of
the black & white color conversion filter. The source code is already set-up to call the kernel and
generate the output, but you will need to uncomment the code inside main() .

Hint #1: The kernel is launched with a 2D grid of 2D blocks . Consider calculating the ID of the
thread in the Y direction to select the specific row, and the ID of the thread in the X direction to
select the specific column.
Hint #2: The boundaries of the image cannot be exceeded. You must include an if -statement to
prevent any issues, based on the width and the height parameters.

Convolution Filtering

The second step is to implement a Gaussian filter to smooth the grayscale image that was generated
through the kernel. We implement this filter to reduce the noise and increase the quality of input for
the next step, Sobel Filter, which is very sensitive to noise. For this exercise, we are going to apply
a Gaussian filter using a 3×3 convolution matrix on all the pixels of the image. The term
convolution is the result of adding each pixel to its local neighbors, weighted by the matrix values:

(https://github.com/PDC-support/cuda-lab-exercises/blob/master/lab_2/html/conv.png)

https://canvas.kth.se/courses/20917/assignments/135985?module_item_id=270560 3/12
31/10/22, 16:21 Assignment III: Advanced CUDA

The * operator represents the convolution, not matrix multiplication. Here, what you have to consider
is to map each pixel as the center of the 3×3 convolution matrix and apply the weights with the
surrounding pixels. As we use symmetric filters, the order can be top-bottom as well.

Once the kernel is implemented and executed, a file images/hw3_result_2.bmp will be generated.
The differences are very fine, to see them, do:

$ montage -tile 2x1 -crop 320x180+512+512 -geometry 640x360 \

images/hw3_result_1.bmp images/hw3_result_2.bmp \

images/hw3_result_2_comp.jpg

$ display images/hw3_result_2_comp.jpg

The new window will display a cropped area of the original black & white image (left), and a cropped
area of the new blurred image (right). The differences are very subtle, but you should be able to
notice some differences:

(https://github.com/PDC-support/cuda-lab-exercises/blob/master/lab_2/html/fig3.jpeg)

TODO: Find the implementation of cpu_applyFilter() inside the hw3_ex1.cu file and try to
understand how a given convolution matrix is applied to a certain pixel.

Hint #1: The input block of the image is given by the top-left corner, not the center of the block
(the target pixel).
Hint #2: This is not a matrix-matrix multiplication, keep this in mind while reviewing the source
code.

Detecting Edges in the Image

Finally, we complete the edge detector by applying the Sobel filter. With this filter, we are going to
compute an approximation of the gradient of the image intensity function. This allows us to create a
new image where the edges are emphasized, which constitutes the base for full edge detection
algorithms such as Canny (https://en.wikipedia.org/wiki/Canny_edge_detector) .

The filter uses two 3×3 kernels which are convolved with the original image to calculate
approximations of the derivatives on the horizontal and vertical directions. In other words, if we define
A as the source image, and Gx and Gy as two convolution matrices that generate the horizontal and
vertical derivative approximations, the computations are as follow:

https://canvas.kth.se/courses/20917/assignments/135985?module_item_id=270560 4/12
31/10/22, 16:21 Assignment III: Advanced CUDA

(https://github.com/PDC-support/cuda-lab-exercises/blob/master/lab_2/html/matrix.png)

The resultant gradient magnitude of the pixel is obtained by calculating the square root of these:

(https://github.com/PDC-support/cuda-lab-exercises/blob/master/lab_2/html/gradient.png)

For the last exercise, we want you to implement the GPU version of cpu_sobel() , which is already
declared in hw3_ex1.cu under the name gpu_sobel() . The implementation of this function is very
similar to gpu_gaussian() , except for the fact that we apply two different convolution filters to the
same pixel and combine the result.

Once the implementation is complete, run the program and open the result with the following two
commands:

$ montage -border 0 -geometry 640x360 -tile 3x1 \

images/hw3.bmp images/hw3_result_1.bmp \

images/hw3_result_3.bmp images/hw3_result_3_comp.jpg

$ display images/hw3_result_3_comp.jpg

A new window will open that displays the original image (left), the black & white image (center), and
finally the result of applying the Gaussian and Sobel filters (right):

(https://github.com/PDC-support/cuda-lab-exercises/blob/master/lab_2/html/fig4.jpeg)

You can also observe how the resulting image looks like in a larger resolution. Use the display
command in combination with the resize flag:

$ display -resize 1280x720 images/hw3_result_3.bmp

Alternatively, you can remove the " -resize 1280x720 " option to visualize a full resolution of the
image. This might take some time to load, but it might be worth it to consider all the small details.
Whether you resize the image or not, you should observe something like the following:

https://canvas.kth.se/courses/20917/assignments/135985?module_item_id=270560 5/12
31/10/22, 16:21 Assignment III: Advanced CUDA

(https://github.com/PDC-support/cuda-lab-exercises/blob/master/lab_2/html/fig5.jpeg)

Optimizing Memory Accesses

In this section, we are going to try to optimize the GPU versions of the Gaussian and Sobel filter by
using the Shared Memory instead. The idea is to bring the content of the image from Global Memory
to Shared Memory in blocks of size BLOCK_SIZE_SH . This constant is also the dimension of each
block inside the grid , plus some additional values in X and Y.

We ask you first to declare the BLOCK_SIZE_SH constant on top of the file, which defines the dimension
of the Shared Memory block. Use the following:

#define BLOCK_SIZE_SH 18

We will provide more details of why we use 18 here and not 16, as in the number of threads per
block.

We will use this constant for the declaration of the memory space inside gpu_gaussian() and
gpu_sobel() . The declaration is defined in the first or one of the first lines of each kernel:

__shared__ float sh_block[BLOCK_SIZE_SH * BLOCK_SIZE_SH];

This will declare a 2D shared block in Shared Memory, using the 1D array representation that we
have already discussed in the previous exercises. The __shared__ attribute is given in the
declaration to suggest the compiler that we want this variable to be located in Shared Memory and
not in Local or Global Memory.

Hence, the first exercise would be to declare the shared block inside gpu_gaussian() and
gpu_sobel() . Then, we ask you to make each thread copy a pixel from the input image into the

shared memory block. You have to call __syncthreads() to guarantee that each thread has finished

https://canvas.kth.se/courses/20917/assignments/135985?module_item_id=270560 6/12
31/10/22, 16:21 Assignment III: Advanced CUDA

retrieving its part of the block before using the data. Thereafter, change the input of the
applyFilter() function to use the shared block instead.

TODO: In hw3_ex1.cu , declare a Shared Memory block within gpu_gaussian() and another one
within gpu_sobel() . Thereafter, introduce the necessary changes to make each thread bring one
pixel value to the shared block. Change the input parameter of applyFilter() to use the shared block
(i.e., instead of a reference to the input image directly).

Hint #1: Use __syncthreads() to guarantee that all the threads have copied their pixels to the
Shared Memory.

If you have implemented it "correctly", you will observe that the output result is not exactly what you
expected it to be. You should see by now something like this, in the case of the Gaussian filter and
the side-by-side comparison with the original image:

(https://github.com/PDC-support/cuda-lab-exercises/blob/master/lab_2/html/fig6.jpeg)

The reason is that the exercise is a little bit more complex than initially, one might expect. With the
change that you just introduced, we are not considering that we also have to bring extra columns and
rows on one of the sides of the block. Without this change, some of the threads are accessing
uninitialized data.

This is the main reason why we declared the constant BLOCK_SIZE_SH with two additional elements
per dimension. This will make sure that all the threads within the block access data that is available
inside the Shared Memory space. As such, the final exercise for you would be to consider the
boundaries of each thread block. We already gave you a hint in the declaration of the constant
BLOCK_SIZE_SH (i.e., two extra columns and rows are needed).

TODO: Extend the Shared Memory version of gpu_gaussian() and gpu_sobel() to transfer part
of the surrounding pixels of the thread block to Shared Memory. Make sure that you do not exceed
the boundaries of the image.

Hint #1: Once again, use __syncthreads() to guarantee that all the threads have copied their
pixels to the Shared Memory. 

Questions to answer in the report

https://canvas.kth.se/courses/20917/assignments/135985?module_item_id=270560 7/12
31/10/22, 16:21 Assignment III: Advanced CUDA

1. Explain how the mapping of GPU thread and thread blocks (which is already implemented for you
in the code) is working.
2. Explain why shared memory can (theoretically) improve performance.
3. Explain why the resulting image looks like a "grid" when the kernel is simply copying in pixels to
the shared block. Explain how this is solved and what are the cases.
4. There are several images of different sizes in the image folder. Try running the program on them
and report how their execution time relates to file sizes.

Exercise 2 - Pinned and Managed Memory


In this exercise, we study the use of pinned memory and managed memory. We will reuse your
solution from Assignment II - Exercise 3, particle mover for this exercise.

Programming exercise on Pinned Memory

Put the file of this sub exercise in:

DD2360-HT19/Assignment_3/ex_2

and call it exercise_2a.cu .

In our particle simulator, once the initial particle data is copied to the GPU after initialization, the
simulation is completely offloaded and time stepped on the GPU. In reality, an application often
includes steps that require processing on the hosts. The dependency implies that data has to be
transferred back and forth every timestep. Implement the following:

1. Modify the program, such that


1. All particles are copied to the GPU at the beginning of a time step.

https://canvas.kth.se/courses/20917/assignments/135985?module_item_id=270560 8/12
31/10/22, 16:21 Assignment III: Advanced CUDA

2. All the particles are copied back to the host after the kernel completes, before proceeding to
the next time step.
2. Use nvprof to study the time spent on data movement and actual computation, with a large
number of particles that can the GPU.
3. Change the appropriate memory allocator to use cudaMallocHost() .
4. Use nvprof to study the time spent on data movement and actual computation, with a large
number of particles that can fill the GPU memory. Also, note the time spent on allocation.

Programming exercise on Managed Memory

Put the file of this sub exercise in:

Assignment_3/ex_2

and call it exercise_2b.cu .

Make necessary changes to the program so it uses managed memory.

1. Change the GPU memory allocators to use cudaMallocManaged() .


2. Eliminate explicit data copy and device pointers.
3. Study the breakdown of timing using nvprof.

Questions to answer in the report

1. What are the differences between pageable memory and pinned memory, what are the tradeoffs?
2. Do you see any difference in terms of break down of execution time after changing to pinned
memory from pageable memory?
3. What is a managed memory? What are the implications of using managed memory?
4. If you are using Tegner or lab computers, the use of managed memory will result in an implicit
memory copy before CUDA kernel launch. Why is that?

Exercise 3 - CUDA Streams / Asynchronous Copy - Particle


Batching
In this exercise, we study the usage of CUDA streams and asynchronous memory copies to overlap
communication and computation. We apply CUDA streams and asynchronous copies to the particle
mover: we will divide particles into batches, move a batch to GPU memory, move the particles in the
batch and move the update particle values back to CPU memory. We will reuse the solution of
exercise 2 for the pinned memory exercise, as pinned memory is required to perform the
asynchronous copy.

Put the file of this sub exercise in:

Assignment_3/ex_3

Implement the following:

https://canvas.kth.se/courses/20917/assignments/135985?module_item_id=270560 9/12
31/10/22, 16:21 Assignment III: Advanced CUDA

1. Change the code you implemented for exercise 2 to divide particles into particle batches of a
given size.
2. Create CUDA streams (1 stream first, then 2 and 4 streams) to copy asynchronously particle
batches, update particles values on the GPU, and copy back these values to GPU memory
3. Compare the performance with particle batches and without (performance of exercise 2 with
pinned memory versus the performance of this exercise) varying the batch size. If possible,
use nvprof to collect traces and the NVIDIA Visual Profiler (nvvp) to visualize the overlap of
communication and computation. To use nvvp, you can check Tutorial: NVVP - Visualize
nvprof Traces

Questions to answer in the report

1. What are the advantages of using CUDA streams and asynchronous memory copies?
2. What is the performance improvement (if any) in using more than one CUDA stream?
3. What is the impact of batch size on the performance?

Bonus Exercise - CUDA Libraries - cuBLAS


In this exercise, we study one of the most well used CUDA library called cuBLAS. The library is used
to accelerate BLAS (Basic Linear Algebra Subroutine) operations. The library has a host interface
and is highly optimized. We use SGEMM (Single precision floating General Matrix Multiply) of square
matrices as a case in the exercise.

Put the file of this sub exercise in:

Assignment_3/ex_bonus

We provide you with a skeleton code in:

DD2360-HT19/Assignment_3/ex_3/exercise_3.cu

For simplicity, we only consider matrices with the width being a multiple 16. The number can be
changed by defining TILE_SIZE. We have also implemented timing for you. Select the appropriate
architecture and compile the code like the following:

$ nvcc -O3 -arch=sm_30 exercise_3.cu -o exercise_3.out -lcurand -lcublas

While developing the sub-exercises, you can comment on some parts of the code in the main for
testing. The code can be executed with -s [matrix size] , and optionally -v , to perform CPU
verification. For example, to test for 1024x1024 matrix with CPU verification:

$ srun -n 1 ./exercise_3.out -s 1024 -v

Matrix size: 1024x1024

Matrix size: 1024x1024

Grid size: 64x64

Tile size: 16x16

Run CPU sgemm: 1

CPU matmul: 3358.208000 ms

https://canvas.kth.se/courses/20917/assignments/135985?module_item_id=270560 10/12
31/10/22, 16:21 Assignment III: Advanced CUDA
GPU cuBLAS matmul: 1.233000 ms

GPU matmul (global memory): 78.396000 ms

GPU matmul (shared memory): 7.610000 ms

GEMM

GEMM is defined as , where and are matrices to be multiplied and is


where the result will be accumulated. is a scalar that controls the scaling of multiplication results
and is another scalar that controls how the result is accumulated to . In this exercise, we only
consider matrix multiplication. Therefore we set and .

Naive matrix multiplication

Study the function and cpu_matmul() and the GPU kernel naive_sgemm_kernel() . Understand how the
CPU version is translated to the GPU version. Also, understand how the threads and thread blocks
are organized.

Just for fun: for those who took the course DD2356, do you remember why the two inner loops in cpu_matmul() are reordered?

Matrix multiplication with shared memory

Study the code in the GPU kernel shared_sgemm_kernel() . The code performs tiled matrix
multiplication. The algorithm of tiled matrix multiplication is the same as the matrix multiplication
algorithm, except that the lowest unit of multiplication sub-matrices instead of scalars. More
information can be found here (http://www.cs.utexas.edu/users/rvdg/LinAlgBook/Section2-5.pdf) .

Fill in the parts of the function that is marked with a TODO comment. Ensure CPU verification
passes.

Matrix multiplication with cuBLAS

Study the code in the function cublas_sgemm() . The function call to cuBLAS SGEMM is already
coded for you. Fill in the blanks. Hint: since all the matrices are squared, all the width and stride are
the same. Ensure CPU verification passes.

Questions to answer in the report

1. Explain why is the matrix size has to be a multiple of 16?


2. Refer to shared_sgemm_kernel() . There are two __syncthreads() in the loop. What are they used
for, in the context of this code?
1. What is the directive that can potentially improve performance in the actual multiplication?
What does it do?
2. There is a large speedup after switching from using global memory to shared memory,
compared to the Edge Detector in Exercise 1. What might be the reason?
3. Refer to cublas_sgemm() . We asked that you compute instead of . It has to
do with an important property of cuBLAS. What is that, and why do we do ?
4. Run the program with different input sizes, for example from 64, 128, ... , to 4096. Make a
grouped bar plot of the execution times of the different versions (CPU, GPU Global, GPU Shared,
GPU cuBLAS). You can plot CPU results in a separate figure if the execution time goes out of the
scale comparing to the rest.
https://canvas.kth.se/courses/20917/assignments/135985?module_item_id=270560 11/12
31/10/22, 16:21 Assignment III: Advanced CUDA

5. The way the execution time benchmark that is implemented in the code is good enough for this
exercise, but in general it is not a good way to do a benchmark. Why?

https://canvas.kth.se/courses/20917/assignments/135985?module_item_id=270560 12/12

You might also like