0% found this document useful (0 votes)
11 views

Parallel Computing Lab4

Uploaded by

Agha Ammar Khan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Parallel Computing Lab4

Uploaded by

Agha Ammar Khan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

AIR UNIVERSITY

DEPARTMENT OF ELECTRICAL & COMPUTER ENGINEERING

EXPERIMENT NO. 4

Lab Title: Introduction to Parallel Programming with CudaC: Exploring CUDA C programming 2D
operations

Student Name: M.Bilal Ijaz, Agha Ammar Khan Reg. No:210316,210300

Objective: Implement and analyze various 2D array/matrix operations in CUDAC

LAB ASSESSMENT:

Attributes Excellent Good Average Satisfactory Unsatisfactory

(5) (4) (3) (2) (1)

Ability to Conduct
Experiment

Ability to assimilate the


results

Effective use of lab


equipment and follows the
lab safety rules

Total Marks: Obtained Marks:

LAB REPORT ASSESSMENT:

Attributes Excellent Good Average Satisfactory Unsatisfactory

(5) (4) (3) (2) (1)

Data Presentation

Experiment Results

Conclusion

Total Marks: Obtained Marks:

Date: 17/10/2024 Signature:


LAB#04
TITLE: Exploring CUDA C Programming: 2D Operations.

Objective:
Implement and analyze various 2D array/matrix operations in CUDAC

Introduction:
The aim of this lab was to explore parallel computing techniques by implementing basic 2D
matrix operations using CUDA C. These operations include matrix addition, matrix
multiplication, matrix transposition, and scalar multiplication. CUDA (Compute Unified
Device Architecture) provides a platform for parallel computing on NVIDIA GPUs, allowing
developers to write code that exploits data-level parallelism for large datasets, such as 2D
matrices.

The primary objective of this lab is to:


• Understand how to allocate and transfer memory between host (CPU) and device
(GPU).
• Use CUDA kernels to implement matrix operations using thread blocks.
• Optimize the design for thread allocation and workload distribution for 2D matrices.

This lab demonstrates the concepts of:


• Grid and block structure for managing threads.
• Memory handling between host and device.
• Synchronization of threads in GPU to ensure correct computation.

Experiment Setup:
• Software: CUDA toolkit, NVIDIA CUDA Compiler (NVCC), C/C++ for code
implementation.
• Hardware: A machine with an NVIDIA GPU compatible with CUDA.
Each operation was implemented on a 16x16 matrix, using a block size of 16x16 threads.
This configuration allowed one thread to compute one element of the matrix.
Matrix Addition:
The task is to add two 2D matrices element-wise. Each thread computes the sum for one element of
the resulting matrix.

Key steps:
1. Matrices A and B are initialized on the host.
2. Memory is allocated on the device, and data is transferred from the host to the
device.
3. A CUDA kernel is launched, where each thread adds corresponding elements of
matrices A and B.
4. The result is copied back from the device to the host.

Matrix Multiplication:
Matrix multiplication involves computing the dot product of the rows of the first matrix with
the columns of the second matrix.

Key steps:
1. Each thread computes the value of one element in the resulting matrix.
2. For each thread, the dot product of one row of matrix A and one column of matrix B
is computed and assigned to the result matrix C.

Matrix Transposition:
Matrix transposition involves switching the rows and columns of a matrix. In this case, each
thread transposes one element of the matrix.

Key steps:
1. A CUDA kernel is launched where each thread switches the row and column indices
to transpose the matrix.
2. For every element A[i][j], it is assigned to B[j][i].
Scalar Multiplication:
Scalar multiplication involves multiplying each element of a matrix by a constant scalar
value.

Key steps:
1. Each thread multiplies the element of the matrix A by a scalar k.
2. The result is stored in matrix C.

Performance Considerations:
For all of the operations:
1. Thread Management: The grid and block dimensions were chosen to optimize the
number of threads per block, ensuring efficient parallelism.
2. Memory Transfer: Efficient transfer of data between host and device is crucial. The
use of pinned memory or using memory pools may further optimize this.
3. Thread Synchronization: No explicit synchronization is required in these operations
since each thread works independently on separate elements of the matrix.

Lab Tasks:
Code and Output:
Task2:

Code and output:


Task3:

Code and Output:


Task4:

Code and Output:


Conclusion:
This lab offered practical experience in performing basic 2D matrix operations using CUDA,
highlighting the power of data parallelism through the use of thread blocks. The programs
showcased significant performance improvements compared to serial CPU execution, as
tasks such as matrix addition, multiplication, transposition, and scalar multiplication were
efficiently distributed across multiple threads in a grid, enabling parallel processing.
The concepts learned in this lab lay a strong foundation for more advanced matrix
operations and optimization techniques, including the use of shared memory, tiling, and
stream-based computations, which will be explored in upcoming labs.

You might also like