HPC-Practical-4Addition of Two Large Vectors
HPC-Practical-4Addition of Two Large Vectors
Title: Design & implementation of Parallel (CUDA) algorithm to Add two large Vector, Multiply Vector
and Matrix and Multiply two N × N arrays using n2.
Outcome: To offload parallel computations to the graphics card, when it is appropriate to doso, and to
give some idea of how to think about code running in the massively parallel environment presented by
today’s graphics cards.
Outcome: Students should understand the basic of GPU computing in the CUDA environment.
Hardware Specification: x86_64 bit, 2 – 2/4 GB DDR RAM, 80 - 500 GB SATA HD, 1GBNIDIA
TITAN X Graphics Card.
Software Specification: Ubuntu 14.04, GPU Driver 352.68, CUDA Toolkit 8.0, CUDNNLibrary v5.0
Introduction:
It has become increasingly common to see supercomputing applications harness the massive parallelism
of graphics cards (Graphics Processing Units or GPUs) to speed up computations. One platform for doing
so is NVIDIA’s Compute Unified Device Architecture (CUDA). We usethe example of Matrix
Multiplication to introduce the basics of GPU computing in the CUDA environment.
Matrix Multiplication is a fundamental building block for scientific computing. Moreover, the
algorithmic patterns of matrix multiplication are representative. Many other algorithms share similar
optimization techniques as matrix multiplication. Therefore, matrix multiplication is oneof the most
important examples in learning parallel programming.
A kernel that allows host code to offload matrix multiplication to the GPU. The kernel function is shown
below,
The first line contains the global keyword declaring that this is an entry-point function for running
code on the device. The declaration float Cvalue = 0 sets aside a register to hold this float value where
we will accumulate the product of the row and column entries. The next two lines help the thread to
discover its row and column within the matrix. It is a good idea to make sure you understand those two
lines before moving on. The if statement in the next line terminates the thread if its row or column place
it outside the bounds of the product matrix. This will happen only in those blocks that overhang either
the right or bottom side of the matrix.
The next three lines loop over the entries of the row of A and the column of B (these have the same size)
needed to compute the (row, col)-entry of the product, and the sum of these products are accumulated in
the Cvalue variable. Matrices A and B are stored in the device’s global memory in row major order,
meaning that the matrix is stored as a one-dimensional array, with the first row followed by the second
row, and so on. Thus to find the index in this linear array ofthe ( i, j ) entry of matrix A. Finally, the last
line of the kernel copies this product into the appropriate element of the product matrix C, in the device’s
global memory.
In light of the memory hierarchy described above, each threads loads (2 × A. width) Elements in Kernel
.From global memory two for each iteration through the loop,one from matrix A and onefrom matrix B.
Since accesses to global memory are relatively slow, this can bog down the kernel, leaving the threads
idle for hundreds of clock cycles, for each access.
Matrix A is shown on the left and matrix B is shown at the top, with matrix C, their product, on the
bottom-right. This is a nice way to lay out the matrices visually, since each element of C is the product
of the row to its left in A and the column above it in B. In the above figure, square thread blocks of
dimension BLOCK_SIZE.
BLOCK_SIZE and will assume that the dimensions of A and B are all multiples of BLOCK_SIZE.
Again, each thread will be responsible for computing one element of the product matrix C.
It decomposes matrices A and B into non-overlapping submatrices of size BLOCK_SIZE ×
BLOCK_SIZE. It shows in above figure in red row and red column. It passesthrough the same number
of these submatrices, since they are of equal length. If it load the left- most of those submatrices of matrix
A into shared memory, and the top-most of those submatrices of matrix B into shared memory, then it
compute the first BLOCK_SIZE products and add them together just by reading the shared memory.
But here is the benefit, as long as it have those submatrices in shared memory, every thread’s thread block
(computing the BLOCK_SIZE × BLOCK_SIZE submatrix of C) can compute that portion of their sum
as well from the same data in shared memory. When each thread has computed this sum, it loads the next
BLOCK_SIZE × BLOCK_SIZE submatrices from A and B, and continue adding the term-by-term
products to our value in C.
Conclusion: Strassen’s Matrix Multiplication Algorithm have been implemented parallel using GPU
computing in the CUDA environment