0% found this document useful (0 votes)
35 views5 pages

HPC 4 B

The document discusses matrix multiplication using CUDA C. It explains what CUDA is, the steps to perform matrix multiplication on the GPU including memory allocation and kernel launch, and provides an example CUDA program to multiply two matrices on the GPU.

Uploaded by

Nayan Jadhav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views5 pages

HPC 4 B

The document discusses matrix multiplication using CUDA C. It explains what CUDA is, the steps to perform matrix multiplication on the GPU including memory allocation and kernel launch, and provides an example CUDA program to multiply two matrices on the GPU.

Uploaded by

Nayan Jadhav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Group A

Assignment 4(B)

Title of the Assignment: Write a Program for Matrix Multiplication using CUDA C

Objective of the Assignment: Students should be able to performProgram for Matrix Multiplication
using CUDA C

Prerequisite:
1. CUDA Concept
2. Matrix Multiplication
3. How to execute Program in CUDA Environment
---------------------------------------------------------------------------------------------------------------

Contents for Theory:


1. What is CUDA
2. Matrix Multiplication
3. Execution of CUDA Environment
--------------------------------------------------------------------------------------------------------------
What is CUDA

CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model
developed by NVIDIA. It allows developers to use the power of NVIDIA graphics processing units (GPUs)
to accelerate computation tasks in various applications, including scientific computing, machine learning,
and computer vision.CUDA provides a set of programming APIs, libraries, and tools that enable developers
to write and execute parallel code on NVIDIA GPUs. It supports popular programming languages like C,
C++, and Python, and provides a simple programming model that abstracts away much of the low-level
details of GPU architecture.

Using CUDA, developers can exploit the massive parallelism and high computational power of GPUs to
accelerate computationally intensive tasks, such as matrix operations, image processing, and deep learning.
CUDA has become an important tool for scientific research and is widely used in fields like physics,
chemistry, biology, and engineering.
Steps for Matrix Multiplication using CUDA
Here are the steps for implementing matrix multiplication using CUDA C:

1. Matrix Initialization: The first step is to initialize the matrices that you want to multiply. You can use
standard C or CUDA functions to allocate memory for the matrices and initialize their values. The
matrices are usually represented as 2D arrays.
2. Memory Allocation: The next step is to allocate memory on the host and the device for the matrices.
You can use the standard C malloc function to allocate memory on the host and the CUDA function
cudaMalloc() to allocate memory on the device.
3. Data Transfer: The third step is to transfer data between the host and the device. You can use the
CUDA function cudaMemcpy() to transfer data from the host to the device or vice versa.
4. Kernel Launch: The fourth step is to launch the CUDA kernel that will perform the matrix
multiplication on the device. You can use the <<<...>>> syntax to specify the number of blocks and
threads to use. Each thread in the kernel will compute one element of the output matrix.
5. Device Synchronization: The fifth step is to synchronize the device to ensure that all kernel
executions have completed before proceeding. You can use the CUDA function
cudaDeviceSynchronize() to synchronize the device.
6. Data Retrieval: The sixth step is to retrieve the result of the computation from the device to the host.
You can use the CUDA function cudaMemcpy() to transfer data from the device to the host.
7. Memory Deallocation: The final step is to deallocate the memory that was allocated on the host and
the device. You can use the C free function to deallocate memory on the host and the CUDA function

cudaFree() to deallocate memory on the device.

Execution of Program over CUDA Environment


1. Install CUDA Toolkit: First, you need to install the CUDA Toolkit on your system. You can
download the CUDA Toolkit from the NVIDIA website and follow the installation instructions
provided.
2. Set up CUDA environment: Once the CUDA Toolkit is installed, you need to set up the CUDA
environment on your system. This involves setting the PATH and LD_LIBRARY_PATH
environment variables to the appropriate directories.
3. Write the CUDA program: You need to write a CUDA program that performs the addition of two
large vectors. You can use a text editor to write the program and save it with a .cu extension.
4. Compile the CUDA program: You need to compile the CUDA program using the nvcc compiler that
comes with the CUDA Toolkit. The command to compile the program is:

5. This will generate an executable program named program_name.


Run the CUDA program: Finally, you can run the CUDA program by executing the executable file
generated in the previous step. The command to run the program is:
Questions:

1. What are the advantages of using CUDA to perform matrix multiplication compared to using a CPU?
2. How do you handle matrices that are too large to fit in GPU memory in CUDA matrix
multiplication?
3. How do you optimize the performance of the CUDA program for matrix multiplication?
4. How do you ensure correctness of the CUDA program for matrix multiplication and verify the
results?

-------------------------------------------Program--------------------------------------------------------
#include <cuda_runtime.h>
#include <iostream>
__global__ void matmul(int* A, int* B, int* C, int N) {
int Row = blockIdx.y*blockDim.y+threadIdx.y;
int Col = blockIdx.x*blockDim.x+threadIdx.x;
if (Row < N && Col < N) {
int Pvalue = 0;
for (int k = 0; k < N; k++) {
Pvalue += A[Row*N+k] * B[k*N+Col];
}
C[Row*N+Col] = Pvalue;
}
}
int main() {
int N = 512;
int size = N * N * sizeof(int);
int* A, * B, * C;
int* dev_A, * dev_B, * dev_C;
cudaMallocHost(&A, size);
cudaMallocHost(&B, size);
cudaMallocHost(&C, size);
cudaMalloc(&dev_A, size);
cudaMalloc(&dev_B, size);
cudaMalloc(&dev_C, size);
// Initialize matrices A and B
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
A[i*N+j] = i*N+j;
B[i*N+j] = j*N+i;
}
}
cudaMemcpy(dev_A, A, size,
cudaMemcpyHostToDevice);
cudaMemcpy(dev_B, B, size,
cudaMemcpyHostToDevice);
dim3 dimBlock(16, 16);
dim3 dimGrid(N/dimBlock.x, N/dimBlock.y);
matmul<<<dimGrid, dimBlock>>>(dev_A, dev_B,
dev_C, N);
cudaMemcpy(C, dev_C
// Print the result
for (int i = 0; i < 10; i++) {
for (int j = 0; j < 10; j++) {
std::cout << C[i*N+j] << " ";
}
std::cout << std::endl;
}
// Free memory
cudaFree(dev_A);
cudaFree(dev_B);
cudaFree(dev_C);
cudaFreeHost(A);
cudaFreeHost(B);
cudaFreeHost(C);
return 0;

You might also like