0% found this document useful (0 votes)

162 views12 pages

Assignment III - Advanced CUDA

This document describes Assignment III: Advanced CUDA. It includes 3 exercises: 1) Implementing an edge detector for images using shared memory on the GPU. 2) Applying a Gaussian filter for noise reduction by implementing convolution on the GPU. 3) Detecting edges in the image by implementing the Sobel filter on the GPU to approximate gradients. The assignment is to be submitted as a PDF report including answers to questions and a link to code in a public GitHub repository with a specific folder structure. It is completed individually or in groups of two.

Uploaded by

Badajuan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

162 views12 pages

Assignment III - Advanced CUDA

Uploaded by

Badajuan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

31/10/22, 16:21 Assignment III: Advanced CUDA

Assignment III: Advanced CUDA

Inlämningsdatum
25 nov 2020 av 23.59
Poäng
1
Lämnar in
en filuppladdning
Filtyper
pdf
Tillgänglig
9 nov 2020 kl 9:00–12 jan 2021 kl 23.59

Den här uppgiften låstes 12 jan 2021 kl 23.59.

In this assignment, we will go through the concepts of the GPU memory hierarchy, optimization
techniques, and CUDA libraries provided by NVIDIA. Some exercises will heavily use the NVIDIA
Profiler for timing, read more about nvprof here.

To submit your assignment, prepare a small report that answers the questions in the exercises.
Submit the report as a PDF with the following filename:

appgpu20_HW3_GroupNumberFromCanvas.pdf

Submit your code in a Git repository, and make it public so we can access it. Use the following folder
structure and include the link in your report:

Assignment_3/ex_ExerciseNumber/your_source_code_files

The assignment is solved and submitted in a group of one or two according to Canvas
signup.

Exercise 1 - CUDA Edge Detector using shared memory

In this exercise, we will implement an edge detector for bitmap images. We will also provide a
reference image and reference output for you to check the answer.

Comment out the function call to CPU versions during development, otherwise, you may
still get a correct output file even if the kernel does not execute.
Run the program from the root of the exercise folder, do not modify the folder names or
folder structure!

Obtain the skeleton code, compile and run like the following, assuming that you are on Tegner with all
modules loaded and you are already allocated with a K420 GPU. Modify the architecture flag if you
are using other GPUs (i.e. -arch=sm_50 for workstations in lab rooms and -arch=sm_37 for K80).

$ module load git

$ git clone https://github.com/steven-chien/DD2360-HT19.git

$ cd DD2360-HT19/Assignment_3/ex_1

$ nvcc -O3 -arch=sm_30 hw3_ex1.cu -o hw3_ex1.out

$ srun -n 1 ./hw3_ex1.out images/hw3.bmp

https://canvas.kth.se/courses/20917/assignments/135985?module_item_id=270560 1/12
31/10/22, 16:21 Assignment III: Advanced CUDA

If you are running the exercise on Tegner, we also recommend that you add the -Y flag when you ssh
to enable X11 forward so that you can display the images directly through the terminal.

Do not change the folder structure or rename anything.

Mapping of input data

The Bitmap (BMP) image format is an uncompressed format. Each BMP file contains an encoded
header that specifies the {width, height} of the image, the number of bits per color plane, and
more. After the header, a subsequent string of interleaved color values follows (e.g., in BGR). Here is
a simplified example of how a 3x3 image looks like inside the file:

Each BGR, from Blue / Green / Red, represents an 8-bit pixel value in the image that encodes the
intensity of each channel. The values span from 0 to 255 in the case of BMP 24bpp2, being 0 the
absence of representation by this color and 255 the full representation. The decoding of the image
is already done for you. The resulting data follows the exact mapping as above, flattened as a
1D array.

Grayscale conversion

The first step of our edge detector is to discard the color information and work directly in black &
white. Since a Bitmap image uses a BGR color space, where the combination of the individual
intensities of each color value represents the final intensity of the specific pixels, we combine these
pixels to generate a BMP 8bpp image in grayscale. In other words, we want only 8 bits per pixel.

For the conversion to grayscale, we are going to use the Colorimetric

(https://en.wikipedia.org/wiki/Grayscale) (luminance-preserving) method by applying the following
conversion using the weighted sum of the three color values:

(https://github.com/PDC-support/cuda-lab-exercises/blob/master/lab_2/html/yuv.png) Note that since the color

information is discarded, each pixel can now be represented in one value instead of three, meaning
the output array will have size width*height. If the program is executed successfully, an image will be
created:

images/hw3_result_1.bmp

The image can be viewed with

https://canvas.kth.se/courses/20917/assignments/135985?module_item_id=270560 2/12
31/10/22, 16:21 Assignment III: Advanced CUDA

$ display -resize 1280x720 images/hw3_result_1.bmp

You will get a new window that displays the converted image in black & white, such as this:

(https://github.com/PDC-support/cuda-lab-exercises/blob/master/lab_2/html/fig2.jpeg)

Do not display the image without resizing since it is very large.

TODO: Find the declaration of gpu_greyscale() in hw3_ex1.cu and implement the GPU version of
the black & white color conversion filter. The source code is already set-up to call the kernel and
generate the output, but you will need to uncomment the code inside main() .

Hint #1: The kernel is launched with a 2D grid of 2D blocks . Consider calculating the ID of the
thread in the Y direction to select the specific row, and the ID of the thread in the X direction to
select the specific column.
Hint #2: The boundaries of the image cannot be exceeded. You must include an if -statement to
prevent any issues, based on the width and the height parameters.

Convolution Filtering

The second step is to implement a Gaussian filter to smooth the grayscale image that was generated
through the kernel. We implement this filter to reduce the noise and increase the quality of input for
the next step, Sobel Filter, which is very sensitive to noise. For this exercise, we are going to apply
a Gaussian filter using a 3×3 convolution matrix on all the pixels of the image. The term
convolution is the result of adding each pixel to its local neighbors, weighted by the matrix values:

(https://github.com/PDC-support/cuda-lab-exercises/blob/master/lab_2/html/conv.png)

https://canvas.kth.se/courses/20917/assignments/135985?module_item_id=270560 3/12
31/10/22, 16:21 Assignment III: Advanced CUDA

The * operator represents the convolution, not matrix multiplication. Here, what you have to consider
is to map each pixel as the center of the 3×3 convolution matrix and apply the weights with the
surrounding pixels. As we use symmetric filters, the order can be top-bottom as well.

Once the kernel is implemented and executed, a file images/hw3_result_2.bmp will be generated.
The differences are very fine, to see them, do:

$ montage -tile 2x1 -crop 320x180+512+512 -geometry 640x360 \

images/hw3_result_1.bmp images/hw3_result_2.bmp \

images/hw3_result_2_comp.jpg

$ display images/hw3_result_2_comp.jpg

The new window will display a cropped area of the original black & white image (left), and a cropped
area of the new blurred image (right). The differences are very subtle, but you should be able to
notice some differences:

(https://github.com/PDC-support/cuda-lab-exercises/blob/master/lab_2/html/fig3.jpeg)

TODO: Find the implementation of cpu_applyFilter() inside the hw3_ex1.cu file and try to
understand how a given convolution matrix is applied to a certain pixel.

Hint #1: The input block of the image is given by the top-left corner, not the center of the block
(the target pixel).
Hint #2: This is not a matrix-matrix multiplication, keep this in mind while reviewing the source
code.

Detecting Edges in the Image

Finally, we complete the edge detector by applying the Sobel filter. With this filter, we are going to
compute an approximation of the gradient of the image intensity function. This allows us to create a
new image where the edges are emphasized, which constitutes the base for full edge detection
algorithms such as Canny (https://en.wikipedia.org/wiki/Canny_edge_detector) .

The filter uses two 3×3 kernels which are convolved with the original image to calculate
approximations of the derivatives on the horizontal and vertical directions. In other words, if we define
A as the source image, and Gx and Gy as two convolution matrices that generate the horizontal and
vertical derivative approximations, the computations are as follow:

https://canvas.kth.se/courses/20917/assignments/135985?module_item_id=270560 4/12
31/10/22, 16:21 Assignment III: Advanced CUDA

(https://github.com/PDC-support/cuda-lab-exercises/blob/master/lab_2/html/matrix.png)

The resultant gradient magnitude of the pixel is obtained by calculating the square root of these:

(https://github.com/PDC-support/cuda-lab-exercises/blob/master/lab_2/html/gradient.png)

For the last exercise, we want you to implement the GPU version of cpu_sobel() , which is already
declared in hw3_ex1.cu under the name gpu_sobel() . The implementation of this function is very
similar to gpu_gaussian() , except for the fact that we apply two different convolution filters to the
same pixel and combine the result.

Once the implementation is complete, run the program and open the result with the following two
commands:

$ montage -border 0 -geometry 640x360 -tile 3x1 \

images/hw3.bmp images/hw3_result_1.bmp \

images/hw3_result_3.bmp images/hw3_result_3_comp.jpg

$ display images/hw3_result_3_comp.jpg

A new window will open that displays the original image (left), the black & white image (center), and
finally the result of applying the Gaussian and Sobel filters (right):

(https://github.com/PDC-support/cuda-lab-exercises/blob/master/lab_2/html/fig4.jpeg)

You can also observe how the resulting image looks like in a larger resolution. Use the display
command in combination with the resize flag:

$ display -resize 1280x720 images/hw3_result_3.bmp

Alternatively, you can remove the " -resize 1280x720 " option to visualize a full resolution of the
image. This might take some time to load, but it might be worth it to consider all the small details.
Whether you resize the image or not, you should observe something like the following:

https://canvas.kth.se/courses/20917/assignments/135985?module_item_id=270560 5/12
31/10/22, 16:21 Assignment III: Advanced CUDA

(https://github.com/PDC-support/cuda-lab-exercises/blob/master/lab_2/html/fig5.jpeg)

Optimizing Memory Accesses

In this section, we are going to try to optimize the GPU versions of the Gaussian and Sobel filter by
using the Shared Memory instead. The idea is to bring the content of the image from Global Memory
to Shared Memory in blocks of size BLOCK_SIZE_SH . This constant is also the dimension of each
block inside the grid , plus some additional values in X and Y.

We ask you first to declare the BLOCK_SIZE_SH constant on top of the file, which defines the dimension
of the Shared Memory block. Use the following:

#define BLOCK_SIZE_SH 18

We will provide more details of why we use 18 here and not 16, as in the number of threads per
block.

We will use this constant for the declaration of the memory space inside gpu_gaussian() and
gpu_sobel() . The declaration is defined in the first or one of the first lines of each kernel:

shared float sh_block[BLOCK_SIZE_SH * BLOCK_SIZE_SH];

This will declare a 2D shared block in Shared Memory, using the 1D array representation that we
have already discussed in the previous exercises. The __shared__ attribute is given in the
declaration to suggest the compiler that we want this variable to be located in Shared Memory and
not in Local or Global Memory.

Hence, the first exercise would be to declare the shared block inside gpu_gaussian() and
gpu_sobel() . Then, we ask you to make each thread copy a pixel from the input image into the

shared memory block. You have to call __syncthreads() to guarantee that each thread has finished

https://canvas.kth.se/courses/20917/assignments/135985?module_item_id=270560 6/12
31/10/22, 16:21 Assignment III: Advanced CUDA

retrieving its part of the block before using the data. Thereafter, change the input of the
applyFilter() function to use the shared block instead.

TODO: In hw3_ex1.cu , declare a Shared Memory block within gpu_gaussian() and another one
within gpu_sobel() . Thereafter, introduce the necessary changes to make each thread bring one
pixel value to the shared block. Change the input parameter of applyFilter() to use the shared block
(i.e., instead of a reference to the input image directly).

Hint #1: Use __syncthreads() to guarantee that all the threads have copied their pixels to the
Shared Memory.

If you have implemented it "correctly", you will observe that the output result is not exactly what you
expected it to be. You should see by now something like this, in the case of the Gaussian filter and
the side-by-side comparison with the original image:

(https://github.com/PDC-support/cuda-lab-exercises/blob/master/lab_2/html/fig6.jpeg)

The reason is that the exercise is a little bit more complex than initially, one might expect. With the
change that you just introduced, we are not considering that we also have to bring extra columns and
rows on one of the sides of the block. Without this change, some of the threads are accessing
uninitialized data.

This is the main reason why we declared the constant BLOCK_SIZE_SH with two additional elements
per dimension. This will make sure that all the threads within the block access data that is available
inside the Shared Memory space. As such, the final exercise for you would be to consider the
boundaries of each thread block. We already gave you a hint in the declaration of the constant
BLOCK_SIZE_SH (i.e., two extra columns and rows are needed).

TODO: Extend the Shared Memory version of gpu_gaussian() and gpu_sobel() to transfer part
of the surrounding pixels of the thread block to Shared Memory. Make sure that you do not exceed
the boundaries of the image.

Hint #1: Once again, use __syncthreads() to guarantee that all the threads have copied their
pixels to the Shared Memory.

Questions to answer in the report

https://canvas.kth.se/courses/20917/assignments/135985?module_item_id=270560 7/12
31/10/22, 16:21 Assignment III: Advanced CUDA

1. Explain how the mapping of GPU thread and thread blocks (which is already implemented for you
in the code) is working.
2. Explain why shared memory can (theoretically) improve performance.
3. Explain why the resulting image looks like a "grid" when the kernel is simply copying in pixels to
the shared block. Explain how this is solved and what are the cases.
4. There are several images of different sizes in the image folder. Try running the program on them
and report how their execution time relates to file sizes.

Exercise 2 - Pinned and Managed Memory

In this exercise, we study the use of pinned memory and managed memory. We will reuse your
solution from Assignment II - Exercise 3, particle mover for this exercise.

Programming exercise on Pinned Memory

Put the file of this sub exercise in:

DD2360-HT19/Assignment_3/ex_2

and call it exercise_2a.cu .

In our particle simulator, once the initial particle data is copied to the GPU after initialization, the
simulation is completely offloaded and time stepped on the GPU. In reality, an application often
includes steps that require processing on the hosts. The dependency implies that data has to be
transferred back and forth every timestep. Implement the following:

1. Modify the program, such that

1. All particles are copied to the GPU at the beginning of a time step.

https://canvas.kth.se/courses/20917/assignments/135985?module_item_id=270560 8/12
31/10/22, 16:21 Assignment III: Advanced CUDA

2. All the particles are copied back to the host after the kernel completes, before proceeding to
the next time step.
2. Use nvprof to study the time spent on data movement and actual computation, with a large
number of particles that can the GPU.
3. Change the appropriate memory allocator to use cudaMallocHost() .
4. Use nvprof to study the time spent on data movement and actual computation, with a large
number of particles that can fill the GPU memory. Also, note the time spent on allocation.

Programming exercise on Managed Memory

Put the file of this sub exercise in:

Assignment_3/ex_2

and call it exercise_2b.cu .

Make necessary changes to the program so it uses managed memory.

1. Change the GPU memory allocators to use cudaMallocManaged() .

2. Eliminate explicit data copy and device pointers.
3. Study the breakdown of timing using nvprof.

Questions to answer in the report

1. What are the differences between pageable memory and pinned memory, what are the tradeoffs?
2. Do you see any difference in terms of break down of execution time after changing to pinned
memory from pageable memory?
3. What is a managed memory? What are the implications of using managed memory?
4. If you are using Tegner or lab computers, the use of managed memory will result in an implicit
memory copy before CUDA kernel launch. Why is that?

Exercise 3 - CUDA Streams / Asynchronous Copy - Particle

Batching
In this exercise, we study the usage of CUDA streams and asynchronous memory copies to overlap
communication and computation. We apply CUDA streams and asynchronous copies to the particle
mover: we will divide particles into batches, move a batch to GPU memory, move the particles in the
batch and move the update particle values back to CPU memory. We will reuse the solution of
exercise 2 for the pinned memory exercise, as pinned memory is required to perform the
asynchronous copy.

Put the file of this sub exercise in:

Assignment_3/ex_3

Implement the following:

https://canvas.kth.se/courses/20917/assignments/135985?module_item_id=270560 9/12
31/10/22, 16:21 Assignment III: Advanced CUDA

1. Change the code you implemented for exercise 2 to divide particles into particle batches of a
given size.
2. Create CUDA streams (1 stream first, then 2 and 4 streams) to copy asynchronously particle
batches, update particles values on the GPU, and copy back these values to GPU memory
3. Compare the performance with particle batches and without (performance of exercise 2 with
pinned memory versus the performance of this exercise) varying the batch size. If possible,
use nvprof to collect traces and the NVIDIA Visual Profiler (nvvp) to visualize the overlap of
communication and computation. To use nvvp, you can check Tutorial: NVVP - Visualize
nvprof Traces

Questions to answer in the report

1. What are the advantages of using CUDA streams and asynchronous memory copies?
2. What is the performance improvement (if any) in using more than one CUDA stream?
3. What is the impact of batch size on the performance?

Bonus Exercise - CUDA Libraries - cuBLAS

In this exercise, we study one of the most well used CUDA library called cuBLAS. The library is used
to accelerate BLAS (Basic Linear Algebra Subroutine) operations. The library has a host interface
and is highly optimized. We use SGEMM (Single precision floating General Matrix Multiply) of square
matrices as a case in the exercise.

Put the file of this sub exercise in:

Assignment_3/ex_bonus

We provide you with a skeleton code in:

DD2360-HT19/Assignment_3/ex_3/exercise_3.cu

For simplicity, we only consider matrices with the width being a multiple 16. The number can be
changed by defining TILE_SIZE. We have also implemented timing for you. Select the appropriate
architecture and compile the code like the following:

$ nvcc -O3 -arch=sm_30 exercise_3.cu -o exercise_3.out -lcurand -lcublas

While developing the sub-exercises, you can comment on some parts of the code in the main for
testing. The code can be executed with -s [matrix size] , and optionally -v , to perform CPU
verification. For example, to test for 1024x1024 matrix with CPU verification:

$ srun -n 1 ./exercise_3.out -s 1024 -v

Matrix size: 1024x1024

Grid size: 64x64

Tile size: 16x16

Run CPU sgemm: 1

CPU matmul: 3358.208000 ms

https://canvas.kth.se/courses/20917/assignments/135985?module_item_id=270560 10/12
31/10/22, 16:21 Assignment III: Advanced CUDA
GPU cuBLAS matmul: 1.233000 ms

GPU matmul (global memory): 78.396000 ms

GPU matmul (shared memory): 7.610000 ms

GEMM

GEMM is defined as , where and are matrices to be multiplied and is

where the result will be accumulated. is a scalar that controls the scaling of multiplication results
and is another scalar that controls how the result is accumulated to . In this exercise, we only
consider matrix multiplication. Therefore we set and .

Naive matrix multiplication

Study the function and cpu_matmul() and the GPU kernel naive_sgemm_kernel() . Understand how the
CPU version is translated to the GPU version. Also, understand how the threads and thread blocks
are organized.

Just for fun: for those who took the course DD2356, do you remember why the two inner loops in cpu_matmul() are reordered?

Matrix multiplication with shared memory

Study the code in the GPU kernel shared_sgemm_kernel() . The code performs tiled matrix
multiplication. The algorithm of tiled matrix multiplication is the same as the matrix multiplication
algorithm, except that the lowest unit of multiplication sub-matrices instead of scalars. More
information can be found here (http://www.cs.utexas.edu/users/rvdg/LinAlgBook/Section2-5.pdf) .

Fill in the parts of the function that is marked with a TODO comment. Ensure CPU verification
passes.

Matrix multiplication with cuBLAS

Study the code in the function cublas_sgemm() . The function call to cuBLAS SGEMM is already
coded for you. Fill in the blanks. Hint: since all the matrices are squared, all the width and stride are
the same. Ensure CPU verification passes.

Questions to answer in the report

1. Explain why is the matrix size has to be a multiple of 16?

2. Refer to shared_sgemm_kernel() . There are two __syncthreads() in the loop. What are they used
for, in the context of this code?
1. What is the directive that can potentially improve performance in the actual multiplication?
What does it do?
2. There is a large speedup after switching from using global memory to shared memory,
compared to the Edge Detector in Exercise 1. What might be the reason?
3. Refer to cublas_sgemm() . We asked that you compute instead of . It has to
do with an important property of cuBLAS. What is that, and why do we do ?
4. Run the program with different input sizes, for example from 64, 128, ... , to 4096. Make a
grouped bar plot of the execution times of the different versions (CPU, GPU Global, GPU Shared,
GPU cuBLAS). You can plot CPU results in a separate figure if the execution time goes out of the
scale comparing to the rest.
https://canvas.kth.se/courses/20917/assignments/135985?module_item_id=270560 11/12
31/10/22, 16:21 Assignment III: Advanced CUDA

5. The way the execution time benchmark that is implemented in the code is good enough for this
exercise, but in general it is not a good way to do a benchmark. Why?

https://canvas.kth.se/courses/20917/assignments/135985?module_item_id=270560 12/12

Chapter 4 Polymorphism
No ratings yet
Chapter 4 Polymorphism
38 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
Opencv GTC Express Shalini Gupta
No ratings yet
Opencv GTC Express Shalini Gupta
47 pages
Module 3.1 - CUDA Parallelism Model: GPU Teaching Kit
No ratings yet
Module 3.1 - CUDA Parallelism Model: GPU Teaching Kit
44 pages
Opencv - Gpu - Opencv Wiki
No ratings yet
Opencv - Gpu - Opencv Wiki
4 pages
Advanced Computer Graphics and Graphics Hardware: CUDA: Course Project
No ratings yet
Advanced Computer Graphics and Graphics Hardware: CUDA: Course Project
8 pages
OpenCV CUDA Functions
100% (1)
OpenCV CUDA Functions
30 pages
Image Processing With CUDA
No ratings yet
Image Processing With CUDA
66 pages
LAB 22 Grab Cut Segmentation Objective: To Perform Grab Cut Segmentation - Part List
No ratings yet
LAB 22 Grab Cut Segmentation Objective: To Perform Grab Cut Segmentation - Part List
11 pages
CUDA Application For Canny Edge Detection
No ratings yet
CUDA Application For Canny Edge Detection
12 pages
CUDA Exercises
No ratings yet
CUDA Exercises
185 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Gpu, Cuda and Pycuda
No ratings yet
Gpu, Cuda and Pycuda
11 pages
Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster
No ratings yet
Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster
17 pages
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
From Everand
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
Fouad Sabry
No ratings yet
hpcxx 2023 d4
No ratings yet
hpcxx 2023 d4
52 pages
L06_GPGPU_CUDA_Programming_1
No ratings yet
L06_GPGPU_CUDA_Programming_1
23 pages
PyCUDA AH PDF
No ratings yet
PyCUDA AH PDF
16 pages
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
No ratings yet
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
3 pages
Cud A Reference Manual
No ratings yet
Cud A Reference Manual
299 pages
CUDA 2D Stencil Computations For The Jacobi Method: Jos e Mar Ia Cecilia, Jos e Manuel Garc Ia, and Manuel Ujald On
No ratings yet
CUDA 2D Stencil Computations For The Jacobi Method: Jos e Mar Ia Cecilia, Jos e Manuel Garc Ia, and Manuel Ujald On
4 pages
3D Hardware design:: Software applications for GPU
From Everand
3D Hardware design:: Software applications for GPU
S Mathioudakis
No ratings yet
Cuda Reference Manual
No ratings yet
Cuda Reference Manual
256 pages
Python, Performance, and GPUs - Towards Data Science
No ratings yet
Python, Performance, and GPUs - Towards Data Science
8 pages
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
From Everand
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
Rodrigo Copetti
No ratings yet
Part2 22
No ratings yet
Part2 22
97 pages
CUDA Putting It All Together
No ratings yet
CUDA Putting It All Together
39 pages
DS1822-ParallelComputing-unit4
No ratings yet
DS1822-ParallelComputing-unit4
16 pages
01 Laurie Stephey
No ratings yet
01 Laurie Stephey
14 pages
S4421 Gpu Computing With Matlab
No ratings yet
S4421 Gpu Computing With Matlab
27 pages
Lecture17 12
No ratings yet
Lecture17 12
86 pages
Image Parallel Processing Based On GPU PDF
No ratings yet
Image Parallel Processing Based On GPU PDF
4 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
CUDA Toolkit Reference Manual
No ratings yet
CUDA Toolkit Reference Manual
441 pages
Intro 2 Cuda
No ratings yet
Intro 2 Cuda
30 pages
Demystify OpenAI Triton · Fkong' Tech Blog
No ratings yet
Demystify OpenAI Triton · Fkong' Tech Blog
17 pages
CUDA-OPENCL
No ratings yet
CUDA-OPENCL
17 pages
hw2
No ratings yet
hw2
12 pages
CUDA Toolkit Reference Manual
No ratings yet
CUDA Toolkit Reference Manual
384 pages
Introduction To CUDA C 3
No ratings yet
Introduction To CUDA C 3
67 pages
3-CUDA
No ratings yet
3-CUDA
5 pages
Laboratory Practice I (410246)
No ratings yet
Laboratory Practice I (410246)
28 pages
Convolution Separable
No ratings yet
Convolution Separable
21 pages
AcceleratingAIAdvancements Pre Print Doube Blind
No ratings yet
AcceleratingAIAdvancements Pre Print Doube Blind
9 pages
Instruction Mix
No ratings yet
Instruction Mix
14 pages
Instruction Mix
No ratings yet
Instruction Mix
14 pages
Instruction Mix
No ratings yet
Instruction Mix
14 pages
Gpucoder Ug
No ratings yet
Gpucoder Ug
560 pages
CUDA PPT Anurita Unit3
No ratings yet
CUDA PPT Anurita Unit3
42 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
CUDA Libraries and CUDA Fortran: Massimiliano Fatica
No ratings yet
CUDA Libraries and CUDA Fortran: Massimiliano Fatica
55 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
cuda_mode_lecture2
No ratings yet
cuda_mode_lecture2
33 pages
Introduccion CUDA C
No ratings yet
Introduccion CUDA C
51 pages
Neural Network Implementation Using CUDA and OpenMP
No ratings yet
Neural Network Implementation Using CUDA and OpenMP
7 pages
Aditya Joshi 23252595 Assign 5
No ratings yet
Aditya Joshi 23252595 Assign 5
7 pages
CUDA Programming Invert
No ratings yet
CUDA Programming Invert
36 pages
Lec 1
No ratings yet
Lec 1
27 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
2023-CSC14120-Lecture01-CUDAIntroduction
No ratings yet
2023-CSC14120-Lecture01-CUDAIntroduction
32 pages
Địa Chỉ Các Bit Trong Các Thanh Ghi Của PIC16F877A PDF
No ratings yet
Địa Chỉ Các Bit Trong Các Thanh Ghi Của PIC16F877A PDF
4 pages
Getting Started With Learning Dash Robots English
No ratings yet
Getting Started With Learning Dash Robots English
4 pages
Uml Diagram Exercise
No ratings yet
Uml Diagram Exercise
6 pages
Deep LearningINAF With MATLAB
No ratings yet
Deep LearningINAF With MATLAB
80 pages
Topic Name Spring Boot Frameworks Rest API Singleton Design Pattern Factory Design Pattern
No ratings yet
Topic Name Spring Boot Frameworks Rest API Singleton Design Pattern Factory Design Pattern
7 pages
Assesment Answer Key
No ratings yet
Assesment Answer Key
11 pages
Generic Best Practices For Hackathon
No ratings yet
Generic Best Practices For Hackathon
1 page
Beginners Python Cheat Sheet Pcc Plotly
No ratings yet
Beginners Python Cheat Sheet Pcc Plotly
2 pages
DL Lab Manual 2022-23
No ratings yet
DL Lab Manual 2022-23
34 pages
AMC Sample Exam MS
No ratings yet
AMC Sample Exam MS
23 pages
Study of Twitter Sentiment Analysis Using Machine
No ratings yet
Study of Twitter Sentiment Analysis Using Machine
7 pages
CMT210 Oop1 Assgn1
No ratings yet
CMT210 Oop1 Assgn1
2 pages
1732800924481_pythoncodes_class9
No ratings yet
1732800924481_pythoncodes_class9
3 pages
Array Vs Linked List
No ratings yet
Array Vs Linked List
7 pages
Bare Metal Programming
No ratings yet
Bare Metal Programming
59 pages
Osy 11 - 14
No ratings yet
Osy 11 - 14
25 pages
9618_s23_qp_41
No ratings yet
9618_s23_qp_41
2 pages
Intro-to Global-Snapshot
No ratings yet
Intro-to Global-Snapshot
18 pages
Economic Dispatch Using Dynamic Programming
No ratings yet
Economic Dispatch Using Dynamic Programming
22 pages
Word2Pdf Documentation
No ratings yet
Word2Pdf Documentation
6 pages
Synopsis (C.G.)
No ratings yet
Synopsis (C.G.)
12 pages
Lab 12
No ratings yet
Lab 12
3 pages
Module 1 System Calls
No ratings yet
Module 1 System Calls
22 pages
Java Notes UNIT-1
No ratings yet
Java Notes UNIT-1
73 pages
String and StringBuilder
50% (2)
String and StringBuilder
42 pages
Lib DB operations
No ratings yet
Lib DB operations
34 pages
Ooc - Model QP Vtu2
No ratings yet
Ooc - Model QP Vtu2
2 pages
Manual installation - React - AWS Amplify Gen 2 Documentation
No ratings yet
Manual installation - React - AWS Amplify Gen 2 Documentation
5 pages
Aktu One View by Aktu SDC - pdf2006480100025
No ratings yet
Aktu One View by Aktu SDC - pdf2006480100025
7 pages

Assignment III - Advanced CUDA

Uploaded by

Assignment III - Advanced CUDA

Uploaded by

31/10/22, 16:21 Assignment III: Advanced CUDA

Assignment III: Advanced CUDA

Den här uppgiften låstes 12 jan 2021 kl 23.59.

Exercise 1 - CUDA Edge Detector using shared memory

$ module load git

$ git clone https://github.com/steven-chien/DD2360-HT19.git

$ nvcc -O3 -arch=sm_30 hw3_ex1.cu -o hw3_ex1.out

$ srun -n 1 ./hw3_ex1.out images/hw3.bmp

Do not change the folder structure or rename anything.

Mapping of input data

For the conversion to grayscale, we are going to use the Colorimetric

(https://github.com/PDC-support/cuda-lab-exercises/blob/master/lab_2/html/yuv.png) Note that since the color

The image can be viewed with

$ display -resize 1280x720 images/hw3_result_1.bmp

Do not display the image without resizing since it is very large.

$ montage -tile 2x1 -crop 320x180+512+512 -geometry 640x360 \

Detecting Edges in the Image

$ montage -border 0 -geometry 640x360 -tile 3x1 \

$ display -resize 1280x720 images/hw3_result_3.bmp

Optimizing Memory Accesses

__shared__ float sh_block[BLOCK_SIZE_SH * BLOCK_SIZE_SH];

Questions to answer in the report

Exercise 2 - Pinned and Managed Memory

Programming exercise on Pinned Memory

Put the file of this sub exercise in:

and call it exercise_2a.cu .

1. Modify the program, such that

Programming exercise on Managed Memory

Put the file of this sub exercise in:

and call it exercise_2b.cu .

Make necessary changes to the program so it uses managed memory.

1. Change the GPU memory allocators to use cudaMallocManaged() .

Questions to answer in the report

Exercise 3 - CUDA Streams / Asynchronous Copy - Particle

Put the file of this sub exercise in:

Implement the following:

Questions to answer in the report

Bonus Exercise - CUDA Libraries - cuBLAS

Put the file of this sub exercise in:

We provide you with a skeleton code in:

$ nvcc -O3 -arch=sm_30 exercise_3.cu -o exercise_3.out -lcurand -lcublas

$ srun -n 1 ./exercise_3.out -s 1024 -v

Matrix size: 1024x1024

Matrix size: 1024x1024

Grid size: 64x64

Tile size: 16x16

Run CPU sgemm: 1

CPU matmul: 3358.208000 ms

GPU matmul (global memory): 78.396000 ms

GPU matmul (shared memory): 7.610000 ms

GEMM is defined as , where and are matrices to be multiplied and is

Naive matrix multiplication

Matrix multiplication with shared memory

Matrix multiplication with cuBLAS

Questions to answer in the report

1. Explain why is the matrix size has to be a multiple of 16?

You might also like

shared float sh_block[BLOCK_SIZE_SH * BLOCK_SIZE_SH];