HPC-Practical-4Addition of Two Large Vectors

The document outlines the design and implementation of a parallel CUDA algorithm for adding large vectors, multiplying vectors and matrices, and multiplying two N × N arrays. It emphasizes the importance of GPU computing and matrix multiplication as a fundamental component of scientific computing. The document also details the prerequisites, hardware and software specifications, and potential applications of the algorithm.

Uploaded by

srushtis1314

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views4 pages

HPC-Practical-4Addition of Two Large Vectors

Uploaded by

srushtis1314

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Experiment 4

Title Design & implementation of Parallel (CUDA) algorithm to Add two

large Vector
Problem statement
Design & implementation of Parallel (CUDA)
algorithm to Add two large Vector, Multiply Vector
and Matrix and Multiply two N × N arrays using n2.

Prerequisite Strassen’s Matrix Multiplication Algorithm and CUDA Tutorials

CO mapped 4
Programming Tools Ubuntu 14.04, GPU Driver 352.68, CUDA Toolkit 8.0,
CUDNNLibrary v5.0

Title: Design & implementation of Parallel (CUDA) algorithm to Add two large Vector, Multiply Vector
and Matrix and Multiply two N × N arrays using n2.

Outcome: To offload parallel computations to the graphics card, when it is appropriate to doso, and to
give some idea of how to think about code running in the massively parallel environment presented by
today’s graphics cards.

Outcome: Students should understand the basic of GPU computing in the CUDA environment.

Prerequisites: Strassen’s Matrix Multiplication Algorithm and CUDA Tutorials.

Hardware Specification: x86_64 bit, 2 – 2/4 GB DDR RAM, 80 - 500 GB SATA HD, 1GBNIDIA
TITAN X Graphics Card.

Software Specification: Ubuntu 14.04, GPU Driver 352.68, CUDA Toolkit 8.0, CUDNNLibrary v5.0

Introduction:
It has become increasingly common to see supercomputing applications harness the massive parallelism
of graphics cards (Graphics Processing Units or GPUs) to speed up computations. One platform for doing
so is NVIDIA’s Compute Unified Device Architecture (CUDA). We usethe example of Matrix
Multiplication to introduce the basics of GPU computing in the CUDA environment.
Matrix Multiplication is a fundamental building block for scientific computing. Moreover, the
algorithmic patterns of matrix multiplication are representative. Many other algorithms share similar
optimization techniques as matrix multiplication. Therefore, matrix multiplication is oneof the most
important examples in learning parallel programming.

A kernel that allows host code to offload matrix multiplication to the GPU. The kernel function is shown
below,

The first line contains the global keyword declaring that this is an entry-point function for running
code on the device. The declaration float Cvalue = 0 sets aside a register to hold this float value where
we will accumulate the product of the row and column entries. The next two lines help the thread to
discover its row and column within the matrix. It is a good idea to make sure you understand those two
lines before moving on. The if statement in the next line terminates the thread if its row or column place
it outside the bounds of the product matrix. This will happen only in those blocks that overhang either
the right or bottom side of the matrix.
The next three lines loop over the entries of the row of A and the column of B (these have the same size)
needed to compute the (row, col)-entry of the product, and the sum of these products are accumulated in
the Cvalue variable. Matrices A and B are stored in the device’s global memory in row major order,
meaning that the matrix is stored as a one-dimensional array, with the first row followed by the second
row, and so on. Thus to find the index in this linear array ofthe ( i, j ) entry of matrix A. Finally, the last
line of the kernel copies this product into the appropriate element of the product matrix C, in the device’s
global memory.

In light of the memory hierarchy described above, each threads loads (2 × A. width) Elements in Kernel
.From global memory two for each iteration through the loop,one from matrix A and onefrom matrix B.
Since accesses to global memory are relatively slow, this can bog down the kernel, leaving the threads
idle for hundreds of clock cycles, for each access.
Matrix A is shown on the left and matrix B is shown at the top, with matrix C, their product, on the
bottom-right. This is a nice way to lay out the matrices visually, since each element of C is the product
of the row to its left in A and the column above it in B. In the above figure, square thread blocks of
dimension BLOCK_SIZE.
BLOCK_SIZE and will assume that the dimensions of A and B are all multiples of BLOCK_SIZE.
Again, each thread will be responsible for computing one element of the product matrix C.
It decomposes matrices A and B into non-overlapping submatrices of size BLOCK_SIZE ×
BLOCK_SIZE. It shows in above figure in red row and red column. It passesthrough the same number
of these submatrices, since they are of equal length. If it load the left- most of those submatrices of matrix
A into shared memory, and the top-most of those submatrices of matrix B into shared memory, then it
compute the first BLOCK_SIZE products and add them together just by reading the shared memory.
But here is the benefit, as long as it have those submatrices in shared memory, every thread’s thread block
(computing the BLOCK_SIZE × BLOCK_SIZE submatrix of C) can compute that portion of their sum
as well from the same data in shared memory. When each thread has computed this sum, it loads the next
BLOCK_SIZE × BLOCK_SIZE submatrices from A and B, and continue adding the term-by-term
products to our value in C.

Applications: 1. Fast Video Trans-coding & Enhancement.

2. Medical Imaging.
3. Neural Networks.
4. Gate-level VLSI Simulation.

Conclusion: Strassen’s Matrix Multiplication Algorithm have been implemented parallel using GPU
computing in the CUDA environment

10 Cuda Dgemm Tiled
No ratings yet
10 Cuda Dgemm Tiled
33 pages
Developing Cultural Competence in PT Practice - APTA
No ratings yet
Developing Cultural Competence in PT Practice - APTA
7 pages
Complete Keyboard Frequencies
100% (1)
Complete Keyboard Frequencies
8 pages
Lab Report 6
No ratings yet
Lab Report 6
12 pages
LinearAlgebra Matlab HW3 V2s
No ratings yet
LinearAlgebra Matlab HW3 V2s
5 pages
Matrix Mult
100% (1)
Matrix Mult
55 pages
HPC File
No ratings yet
HPC File
22 pages
HPC 4 B
No ratings yet
HPC 4 B
5 pages
20 Quiz 14
No ratings yet
20 Quiz 14
12 pages
Parallel Computing Lab4
No ratings yet
Parallel Computing Lab4
13 pages
Matrix Computation On The GPU
No ratings yet
Matrix Computation On The GPU
455 pages
High-Performance Matrix-Vector Multiplication On The GPU: Abstract
No ratings yet
High-Performance Matrix-Vector Multiplication On The GPU: Abstract
10 pages
BECOA157 Parallel Matrix Multiplication
No ratings yet
BECOA157 Parallel Matrix Multiplication
3 pages
217 Lec3
No ratings yet
217 Lec3
46 pages
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
No ratings yet
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
132 pages
Multithreaded Architectures: Memory and Data Locality
No ratings yet
Multithreaded Architectures: Memory and Data Locality
39 pages
ORNL Tensor Core Training Aug2019
No ratings yet
ORNL Tensor Core Training Aug2019
113 pages
CUDA MatrixMultiplication
No ratings yet
CUDA MatrixMultiplication
2 pages
Threads
No ratings yet
Threads
54 pages
HW 4
No ratings yet
HW 4
3 pages
LP1 1
No ratings yet
LP1 1
129 pages
PDC Assignment
No ratings yet
PDC Assignment
9 pages
Breaking Through Memory Limitation in GPU Parallel Processing Using Strassen Algorithm
No ratings yet
Breaking Through Memory Limitation in GPU Parallel Processing Using Strassen Algorithm
5 pages
CUDA Programming: Johan Seland Johan - Seland@sintef - No
No ratings yet
CUDA Programming: Johan Seland Johan - Seland@sintef - No
76 pages
Cuda Examples
No ratings yet
Cuda Examples
5 pages
HPC (Pra 04)
No ratings yet
HPC (Pra 04)
11 pages
CUDA Libraries and CUDA Fortran: Massimiliano Fatica
No ratings yet
CUDA Libraries and CUDA Fortran: Massimiliano Fatica
55 pages
Lecture 4
No ratings yet
Lecture 4
48 pages
Lecture17 12
No ratings yet
Lecture17 12
86 pages
Advanced Computer Architecture 1
No ratings yet
Advanced Computer Architecture 1
14 pages
Blocked Matrix Multiply
No ratings yet
Blocked Matrix Multiply
6 pages
Input: Output: 1. Sub String Program
No ratings yet
Input: Output: 1. Sub String Program
8 pages
Matrix Multiplication Using SIMD Technologies
No ratings yet
Matrix Multiplication Using SIMD Technologies
13 pages
Department of Computer Engineering BE Laboratory Practice-I A.Y 2021-22 SEM1
No ratings yet
Department of Computer Engineering BE Laboratory Practice-I A.Y 2021-22 SEM1
45 pages
CUDA
No ratings yet
CUDA
3 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
Dense Matrix Algebra On The GPU
No ratings yet
Dense Matrix Algebra On The GPU
22 pages
Cuda Add Mult
No ratings yet
Cuda Add Mult
3 pages
Combinepdf
No ratings yet
Combinepdf
28 pages
Cuuda Nvidai Guide - Part3
No ratings yet
Cuuda Nvidai Guide - Part3
15 pages
CUDA Part-2
No ratings yet
CUDA Part-2
49 pages
Rishi
No ratings yet
Rishi
30 pages
Understanding The Efficiency of GPU Algorithms For Matrix-Matrix Multiplication
No ratings yet
Understanding The Efficiency of GPU Algorithms For Matrix-Matrix Multiplication
5 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
Intro To Matlab GPU Programming
No ratings yet
Intro To Matlab GPU Programming
35 pages
Matrix Multiplication With CUDA - A Basic Introduction To The CUDA Programming Model
No ratings yet
Matrix Multiplication With CUDA - A Basic Introduction To The CUDA Programming Model
44 pages
PyCUDA AH PDF
No ratings yet
PyCUDA AH PDF
16 pages
Lab6 - Linear Algebra in C On A Microcontroller
No ratings yet
Lab6 - Linear Algebra in C On A Microcontroller
8 pages
5 Computation
No ratings yet
5 Computation
13 pages
cs239 Ejer1
No ratings yet
cs239 Ejer1
2 pages
Allocate The Device Memory Where We Will Copy M
No ratings yet
Allocate The Device Memory Where We Will Copy M
2 pages
Cuda Firstprograms PDF
No ratings yet
Cuda Firstprograms PDF
6 pages
Unit II Matrix Multiplication
No ratings yet
Unit II Matrix Multiplication
23 pages
Web GPU
0% (1)
Web GPU
40 pages
Lab 1 Parallel
No ratings yet
Lab 1 Parallel
4 pages
Matrix-Matrix Multiplication Using Shared Memory
No ratings yet
Matrix-Matrix Multiplication Using Shared Memory
27 pages
CUDA Part-1
No ratings yet
CUDA Part-1
52 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
CUDA Memory Types: Parallel and High Performance Computing
No ratings yet
CUDA Memory Types: Parallel and High Performance Computing
27 pages
GPU Assignment-3 Solution
No ratings yet
GPU Assignment-3 Solution
4 pages
A Practical Performance Comparison of Parallel Matrix Multiplication Algorithms On Networks of Workstations
No ratings yet
A Practical Performance Comparison of Parallel Matrix Multiplication Algorithms On Networks of Workstations
2 pages
Line Drawing Algorithm: Mastering Techniques for Precision Image Rendering
From Everand
Line Drawing Algorithm: Mastering Techniques for Precision Image Rendering
Fouad Sabry
No ratings yet
Cryptographic Smooth Neighbors
No ratings yet
Cryptographic Smooth Neighbors
24 pages
Interview Guidelines
No ratings yet
Interview Guidelines
1 page
Class 4 Summer Holiday Homework 2025-26
No ratings yet
Class 4 Summer Holiday Homework 2025-26
22 pages
A House On Fire Book
No ratings yet
A House On Fire Book
77 pages
Lawson mt3 Commentary
No ratings yet
Lawson mt3 Commentary
6 pages
Busmath Week 1
No ratings yet
Busmath Week 1
19 pages
Deacriptive Writing Teaching
No ratings yet
Deacriptive Writing Teaching
13 pages
Grammar File4
No ratings yet
Grammar File4
2 pages
DBMS Some Important Questions
No ratings yet
DBMS Some Important Questions
2 pages
Activity in Class 17-5-2022
No ratings yet
Activity in Class 17-5-2022
2 pages
Ayachi BPSC Tre 4 0 English (TGT, Class-9th & 10th) Complete Foundation With Final Selection Batch 2024 - Online Live Classes by Adda 247
No ratings yet
Ayachi BPSC Tre 4 0 English (TGT, Class-9th & 10th) Complete Foundation With Final Selection Batch 2024 - Online Live Classes by Adda 247
2 pages
TABLE OF SPECIFICATIONS in IWRBS 1st Quarter Test
No ratings yet
TABLE OF SPECIFICATIONS in IWRBS 1st Quarter Test
2 pages
Maths
No ratings yet
Maths
7 pages
Math Test Mix Up Worksheets RAZ
No ratings yet
Math Test Mix Up Worksheets RAZ
3 pages
Unit 5, Gender and Education Reading
No ratings yet
Unit 5, Gender and Education Reading
4 pages
Paper II LDC DMR
No ratings yet
Paper II LDC DMR
9 pages
Procedures de Maintenance
No ratings yet
Procedures de Maintenance
87 pages
Obscure Geo TH Ms
No ratings yet
Obscure Geo TH Ms
33 pages
Materi Song Kirim
No ratings yet
Materi Song Kirim
4 pages
Mrs. I. Madhavi Assistant Professor GST, GITAM University
No ratings yet
Mrs. I. Madhavi Assistant Professor GST, GITAM University
20 pages
All About History - Book of Ancient Greece
100% (6)
All About History - Book of Ancient Greece
165 pages
Ancient Egyptian Seals and Scarabs
No ratings yet
Ancient Egyptian Seals and Scarabs
55 pages
SrinivasuluReddy Gaddam Resume - PDF 2
No ratings yet
SrinivasuluReddy Gaddam Resume - PDF 2
2 pages
101
No ratings yet
101
355 pages
Ielts Listening Test
No ratings yet
Ielts Listening Test
15 pages
Calming The Storm in Matthew
No ratings yet
Calming The Storm in Matthew
2 pages
BEGC-101 June 2022
No ratings yet
BEGC-101 June 2022
4 pages
Soal Latihan Bahasa Inggris Kelas 1
No ratings yet
Soal Latihan Bahasa Inggris Kelas 1
2 pages

HPC-Practical-4Addition of Two Large Vectors

Uploaded by

HPC-Practical-4Addition of Two Large Vectors

Uploaded by

Experiment 4

Title Design & implementation of Parallel (CUDA) algorithm to Add two

Prerequisite Strassen’s Matrix Multiplication Algorithm and CUDA Tutorials

Prerequisites: Strassen’s Matrix Multiplication Algorithm and CUDA Tutorials.

Applications: 1. Fast Video Trans-coding & Enhancement.

You might also like