Assignment III - Advanced CUDA
Assignment III - Advanced CUDA
Inlämningsdatum
25 nov 2020 av 23.59
Poäng
1
Lämnar in
en filuppladdning
Filtyper
pdf
Tillgänglig
9 nov 2020 kl 9:00–12 jan 2021 kl 23.59
In this assignment, we will go through the concepts of the GPU memory hierarchy, optimization
techniques, and CUDA libraries provided by NVIDIA. Some exercises will heavily use the NVIDIA
Profiler for timing, read more about nvprof here.
To submit your assignment, prepare a small report that answers the questions in the exercises.
Submit the report as a PDF with the following filename:
appgpu20_HW3_GroupNumberFromCanvas.pdf
Submit your code in a Git repository, and make it public so we can access it. Use the following folder
structure and include the link in your report:
Assignment_3/ex_ExerciseNumber/your_source_code_files
The assignment is solved and submitted in a group of one or two according to Canvas
signup.
Comment out the function call to CPU versions during development, otherwise, you may
still get a correct output file even if the kernel does not execute.
Run the program from the root of the exercise folder, do not modify the folder names or
folder structure!
Obtain the skeleton code, compile and run like the following, assuming that you are on Tegner with all
modules loaded and you are already allocated with a K420 GPU. Modify the architecture flag if you
are using other GPUs (i.e. -arch=sm_50 for workstations in lab rooms and -arch=sm_37 for K80).
$ cd DD2360-HT19/Assignment_3/ex_1
https://canvas.kth.se/courses/20917/assignments/135985?module_item_id=270560 1/12
31/10/22, 16:21 Assignment III: Advanced CUDA
If you are running the exercise on Tegner, we also recommend that you add the -Y flag when you ssh
to enable X11 forward so that you can display the images directly through the terminal.
The Bitmap (BMP) image format is an uncompressed format. Each BMP file contains an encoded
header that specifies the {width, height} of the image, the number of bits per color plane, and
more. After the header, a subsequent string of interleaved color values follows (e.g., in BGR). Here is
a simplified example of how a 3x3 image looks like inside the file:
Each BGR, from Blue / Green / Red, represents an 8-bit pixel value in the image that encodes the
intensity of each channel. The values span from 0 to 255 in the case of BMP 24bpp2, being 0 the
absence of representation by this color and 255 the full representation. The decoding of the image
is already done for you. The resulting data follows the exact mapping as above, flattened as a
1D array.
Grayscale conversion
The first step of our edge detector is to discard the color information and work directly in black &
white. Since a Bitmap image uses a BGR color space, where the combination of the individual
intensities of each color value represents the final intensity of the specific pixels, we combine these
pixels to generate a BMP 8bpp image in grayscale. In other words, we want only 8 bits per pixel.
images/hw3_result_1.bmp
You will get a new window that displays the converted image in black & white, such as this:
(https://github.com/PDC-support/cuda-lab-exercises/blob/master/lab_2/html/fig2.jpeg)
TODO: Find the declaration of gpu_greyscale() in hw3_ex1.cu and implement the GPU version of
the black & white color conversion filter. The source code is already set-up to call the kernel and
generate the output, but you will need to uncomment the code inside main() .
Hint #1: The kernel is launched with a 2D grid of 2D blocks . Consider calculating the ID of the
thread in the Y direction to select the specific row, and the ID of the thread in the X direction to
select the specific column.
Hint #2: The boundaries of the image cannot be exceeded. You must include an if -statement to
prevent any issues, based on the width and the height parameters.
Convolution Filtering
The second step is to implement a Gaussian filter to smooth the grayscale image that was generated
through the kernel. We implement this filter to reduce the noise and increase the quality of input for
the next step, Sobel Filter, which is very sensitive to noise. For this exercise, we are going to apply
a Gaussian filter using a 3×3 convolution matrix on all the pixels of the image. The term
convolution is the result of adding each pixel to its local neighbors, weighted by the matrix values:
(https://github.com/PDC-support/cuda-lab-exercises/blob/master/lab_2/html/conv.png)
https://canvas.kth.se/courses/20917/assignments/135985?module_item_id=270560 3/12
31/10/22, 16:21 Assignment III: Advanced CUDA
The * operator represents the convolution, not matrix multiplication. Here, what you have to consider
is to map each pixel as the center of the 3×3 convolution matrix and apply the weights with the
surrounding pixels. As we use symmetric filters, the order can be top-bottom as well.
Once the kernel is implemented and executed, a file images/hw3_result_2.bmp will be generated.
The differences are very fine, to see them, do:
images/hw3_result_1.bmp images/hw3_result_2.bmp \
images/hw3_result_2_comp.jpg
$ display images/hw3_result_2_comp.jpg
The new window will display a cropped area of the original black & white image (left), and a cropped
area of the new blurred image (right). The differences are very subtle, but you should be able to
notice some differences:
(https://github.com/PDC-support/cuda-lab-exercises/blob/master/lab_2/html/fig3.jpeg)
TODO: Find the implementation of cpu_applyFilter() inside the hw3_ex1.cu file and try to
understand how a given convolution matrix is applied to a certain pixel.
Hint #1: The input block of the image is given by the top-left corner, not the center of the block
(the target pixel).
Hint #2: This is not a matrix-matrix multiplication, keep this in mind while reviewing the source
code.
Finally, we complete the edge detector by applying the Sobel filter. With this filter, we are going to
compute an approximation of the gradient of the image intensity function. This allows us to create a
new image where the edges are emphasized, which constitutes the base for full edge detection
algorithms such as Canny (https://en.wikipedia.org/wiki/Canny_edge_detector) .
The filter uses two 3×3 kernels which are convolved with the original image to calculate
approximations of the derivatives on the horizontal and vertical directions. In other words, if we define
A as the source image, and Gx and Gy as two convolution matrices that generate the horizontal and
vertical derivative approximations, the computations are as follow:
https://canvas.kth.se/courses/20917/assignments/135985?module_item_id=270560 4/12
31/10/22, 16:21 Assignment III: Advanced CUDA
(https://github.com/PDC-support/cuda-lab-exercises/blob/master/lab_2/html/matrix.png)
The resultant gradient magnitude of the pixel is obtained by calculating the square root of these:
(https://github.com/PDC-support/cuda-lab-exercises/blob/master/lab_2/html/gradient.png)
For the last exercise, we want you to implement the GPU version of cpu_sobel() , which is already
declared in hw3_ex1.cu under the name gpu_sobel() . The implementation of this function is very
similar to gpu_gaussian() , except for the fact that we apply two different convolution filters to the
same pixel and combine the result.
Once the implementation is complete, run the program and open the result with the following two
commands:
images/hw3.bmp images/hw3_result_1.bmp \
images/hw3_result_3.bmp images/hw3_result_3_comp.jpg
$ display images/hw3_result_3_comp.jpg
A new window will open that displays the original image (left), the black & white image (center), and
finally the result of applying the Gaussian and Sobel filters (right):
(https://github.com/PDC-support/cuda-lab-exercises/blob/master/lab_2/html/fig4.jpeg)
You can also observe how the resulting image looks like in a larger resolution. Use the display
command in combination with the resize flag:
Alternatively, you can remove the " -resize 1280x720 " option to visualize a full resolution of the
image. This might take some time to load, but it might be worth it to consider all the small details.
Whether you resize the image or not, you should observe something like the following:
https://canvas.kth.se/courses/20917/assignments/135985?module_item_id=270560 5/12
31/10/22, 16:21 Assignment III: Advanced CUDA
(https://github.com/PDC-support/cuda-lab-exercises/blob/master/lab_2/html/fig5.jpeg)
In this section, we are going to try to optimize the GPU versions of the Gaussian and Sobel filter by
using the Shared Memory instead. The idea is to bring the content of the image from Global Memory
to Shared Memory in blocks of size BLOCK_SIZE_SH . This constant is also the dimension of each
block inside the grid , plus some additional values in X and Y.
We ask you first to declare the BLOCK_SIZE_SH constant on top of the file, which defines the dimension
of the Shared Memory block. Use the following:
#define BLOCK_SIZE_SH 18
We will provide more details of why we use 18 here and not 16, as in the number of threads per
block.
We will use this constant for the declaration of the memory space inside gpu_gaussian() and
gpu_sobel() . The declaration is defined in the first or one of the first lines of each kernel:
This will declare a 2D shared block in Shared Memory, using the 1D array representation that we
have already discussed in the previous exercises. The __shared__ attribute is given in the
declaration to suggest the compiler that we want this variable to be located in Shared Memory and
not in Local or Global Memory.
Hence, the first exercise would be to declare the shared block inside gpu_gaussian() and
gpu_sobel() . Then, we ask you to make each thread copy a pixel from the input image into the
shared memory block. You have to call __syncthreads() to guarantee that each thread has finished
https://canvas.kth.se/courses/20917/assignments/135985?module_item_id=270560 6/12
31/10/22, 16:21 Assignment III: Advanced CUDA
retrieving its part of the block before using the data. Thereafter, change the input of the
applyFilter() function to use the shared block instead.
TODO: In hw3_ex1.cu , declare a Shared Memory block within gpu_gaussian() and another one
within gpu_sobel() . Thereafter, introduce the necessary changes to make each thread bring one
pixel value to the shared block. Change the input parameter of applyFilter() to use the shared block
(i.e., instead of a reference to the input image directly).
Hint #1: Use __syncthreads() to guarantee that all the threads have copied their pixels to the
Shared Memory.
If you have implemented it "correctly", you will observe that the output result is not exactly what you
expected it to be. You should see by now something like this, in the case of the Gaussian filter and
the side-by-side comparison with the original image:
(https://github.com/PDC-support/cuda-lab-exercises/blob/master/lab_2/html/fig6.jpeg)
The reason is that the exercise is a little bit more complex than initially, one might expect. With the
change that you just introduced, we are not considering that we also have to bring extra columns and
rows on one of the sides of the block. Without this change, some of the threads are accessing
uninitialized data.
This is the main reason why we declared the constant BLOCK_SIZE_SH with two additional elements
per dimension. This will make sure that all the threads within the block access data that is available
inside the Shared Memory space. As such, the final exercise for you would be to consider the
boundaries of each thread block. We already gave you a hint in the declaration of the constant
BLOCK_SIZE_SH (i.e., two extra columns and rows are needed).
TODO: Extend the Shared Memory version of gpu_gaussian() and gpu_sobel() to transfer part
of the surrounding pixels of the thread block to Shared Memory. Make sure that you do not exceed
the boundaries of the image.
Hint #1: Once again, use __syncthreads() to guarantee that all the threads have copied their
pixels to the Shared Memory.
https://canvas.kth.se/courses/20917/assignments/135985?module_item_id=270560 7/12
31/10/22, 16:21 Assignment III: Advanced CUDA
1. Explain how the mapping of GPU thread and thread blocks (which is already implemented for you
in the code) is working.
2. Explain why shared memory can (theoretically) improve performance.
3. Explain why the resulting image looks like a "grid" when the kernel is simply copying in pixels to
the shared block. Explain how this is solved and what are the cases.
4. There are several images of different sizes in the image folder. Try running the program on them
and report how their execution time relates to file sizes.
DD2360-HT19/Assignment_3/ex_2
In our particle simulator, once the initial particle data is copied to the GPU after initialization, the
simulation is completely offloaded and time stepped on the GPU. In reality, an application often
includes steps that require processing on the hosts. The dependency implies that data has to be
transferred back and forth every timestep. Implement the following:
https://canvas.kth.se/courses/20917/assignments/135985?module_item_id=270560 8/12
31/10/22, 16:21 Assignment III: Advanced CUDA
2. All the particles are copied back to the host after the kernel completes, before proceeding to
the next time step.
2. Use nvprof to study the time spent on data movement and actual computation, with a large
number of particles that can the GPU.
3. Change the appropriate memory allocator to use cudaMallocHost() .
4. Use nvprof to study the time spent on data movement and actual computation, with a large
number of particles that can fill the GPU memory. Also, note the time spent on allocation.
Assignment_3/ex_2
1. What are the differences between pageable memory and pinned memory, what are the tradeoffs?
2. Do you see any difference in terms of break down of execution time after changing to pinned
memory from pageable memory?
3. What is a managed memory? What are the implications of using managed memory?
4. If you are using Tegner or lab computers, the use of managed memory will result in an implicit
memory copy before CUDA kernel launch. Why is that?
Assignment_3/ex_3
https://canvas.kth.se/courses/20917/assignments/135985?module_item_id=270560 9/12
31/10/22, 16:21 Assignment III: Advanced CUDA
1. Change the code you implemented for exercise 2 to divide particles into particle batches of a
given size.
2. Create CUDA streams (1 stream first, then 2 and 4 streams) to copy asynchronously particle
batches, update particles values on the GPU, and copy back these values to GPU memory
3. Compare the performance with particle batches and without (performance of exercise 2 with
pinned memory versus the performance of this exercise) varying the batch size. If possible,
use nvprof to collect traces and the NVIDIA Visual Profiler (nvvp) to visualize the overlap of
communication and computation. To use nvvp, you can check Tutorial: NVVP - Visualize
nvprof Traces
1. What are the advantages of using CUDA streams and asynchronous memory copies?
2. What is the performance improvement (if any) in using more than one CUDA stream?
3. What is the impact of batch size on the performance?
Assignment_3/ex_bonus
DD2360-HT19/Assignment_3/ex_3/exercise_3.cu
For simplicity, we only consider matrices with the width being a multiple 16. The number can be
changed by defining TILE_SIZE. We have also implemented timing for you. Select the appropriate
architecture and compile the code like the following:
While developing the sub-exercises, you can comment on some parts of the code in the main for
testing. The code can be executed with -s [matrix size] , and optionally -v , to perform CPU
verification. For example, to test for 1024x1024 matrix with CPU verification:
https://canvas.kth.se/courses/20917/assignments/135985?module_item_id=270560 10/12
31/10/22, 16:21 Assignment III: Advanced CUDA
GPU cuBLAS matmul: 1.233000 ms
GEMM
Study the function and cpu_matmul() and the GPU kernel naive_sgemm_kernel() . Understand how the
CPU version is translated to the GPU version. Also, understand how the threads and thread blocks
are organized.
Just for fun: for those who took the course DD2356, do you remember why the two inner loops in cpu_matmul() are reordered?
Study the code in the GPU kernel shared_sgemm_kernel() . The code performs tiled matrix
multiplication. The algorithm of tiled matrix multiplication is the same as the matrix multiplication
algorithm, except that the lowest unit of multiplication sub-matrices instead of scalars. More
information can be found here (http://www.cs.utexas.edu/users/rvdg/LinAlgBook/Section2-5.pdf) .
Fill in the parts of the function that is marked with a TODO comment. Ensure CPU verification
passes.
Study the code in the function cublas_sgemm() . The function call to cuBLAS SGEMM is already
coded for you. Fill in the blanks. Hint: since all the matrices are squared, all the width and stride are
the same. Ensure CPU verification passes.
5. The way the execution time benchmark that is implemented in the code is good enough for this
exercise, but in general it is not a good way to do a benchmark. Why?
https://canvas.kth.se/courses/20917/assignments/135985?module_item_id=270560 12/12