HPC 4 B
HPC 4 B
Assignment 4(B)
Title of the Assignment: Write a Program for Matrix Multiplication using CUDA C
Objective of the Assignment: Students should be able to performProgram for Matrix Multiplication
using CUDA C
Prerequisite:
1. CUDA Concept
2. Matrix Multiplication
3. How to execute Program in CUDA Environment
---------------------------------------------------------------------------------------------------------------
CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model
developed by NVIDIA. It allows developers to use the power of NVIDIA graphics processing units (GPUs)
to accelerate computation tasks in various applications, including scientific computing, machine learning,
and computer vision.CUDA provides a set of programming APIs, libraries, and tools that enable developers
to write and execute parallel code on NVIDIA GPUs. It supports popular programming languages like C,
C++, and Python, and provides a simple programming model that abstracts away much of the low-level
details of GPU architecture.
Using CUDA, developers can exploit the massive parallelism and high computational power of GPUs to
accelerate computationally intensive tasks, such as matrix operations, image processing, and deep learning.
CUDA has become an important tool for scientific research and is widely used in fields like physics,
chemistry, biology, and engineering.
Steps for Matrix Multiplication using CUDA
Here are the steps for implementing matrix multiplication using CUDA C:
1. Matrix Initialization: The first step is to initialize the matrices that you want to multiply. You can use
standard C or CUDA functions to allocate memory for the matrices and initialize their values. The
matrices are usually represented as 2D arrays.
2. Memory Allocation: The next step is to allocate memory on the host and the device for the matrices.
You can use the standard C malloc function to allocate memory on the host and the CUDA function
cudaMalloc() to allocate memory on the device.
3. Data Transfer: The third step is to transfer data between the host and the device. You can use the
CUDA function cudaMemcpy() to transfer data from the host to the device or vice versa.
4. Kernel Launch: The fourth step is to launch the CUDA kernel that will perform the matrix
multiplication on the device. You can use the <<<...>>> syntax to specify the number of blocks and
threads to use. Each thread in the kernel will compute one element of the output matrix.
5. Device Synchronization: The fifth step is to synchronize the device to ensure that all kernel
executions have completed before proceeding. You can use the CUDA function
cudaDeviceSynchronize() to synchronize the device.
6. Data Retrieval: The sixth step is to retrieve the result of the computation from the device to the host.
You can use the CUDA function cudaMemcpy() to transfer data from the device to the host.
7. Memory Deallocation: The final step is to deallocate the memory that was allocated on the host and
the device. You can use the C free function to deallocate memory on the host and the CUDA function
1. What are the advantages of using CUDA to perform matrix multiplication compared to using a CPU?
2. How do you handle matrices that are too large to fit in GPU memory in CUDA matrix
multiplication?
3. How do you optimize the performance of the CUDA program for matrix multiplication?
4. How do you ensure correctness of the CUDA program for matrix multiplication and verify the
results?
-------------------------------------------Program--------------------------------------------------------
#include <cuda_runtime.h>
#include <iostream>
__global__ void matmul(int* A, int* B, int* C, int N) {
int Row = blockIdx.y*blockDim.y+threadIdx.y;
int Col = blockIdx.x*blockDim.x+threadIdx.x;
if (Row < N && Col < N) {
int Pvalue = 0;
for (int k = 0; k < N; k++) {
Pvalue += A[Row*N+k] * B[k*N+Col];
}
C[Row*N+Col] = Pvalue;
}
}
int main() {
int N = 512;
int size = N * N * sizeof(int);
int* A, * B, * C;
int* dev_A, * dev_B, * dev_C;
cudaMallocHost(&A, size);
cudaMallocHost(&B, size);
cudaMallocHost(&C, size);
cudaMalloc(&dev_A, size);
cudaMalloc(&dev_B, size);
cudaMalloc(&dev_C, size);
// Initialize matrices A and B
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
A[i*N+j] = i*N+j;
B[i*N+j] = j*N+i;
}
}
cudaMemcpy(dev_A, A, size,
cudaMemcpyHostToDevice);
cudaMemcpy(dev_B, B, size,
cudaMemcpyHostToDevice);
dim3 dimBlock(16, 16);
dim3 dimGrid(N/dimBlock.x, N/dimBlock.y);
matmul<<<dimGrid, dimBlock>>>(dev_A, dev_B,
dev_C, N);
cudaMemcpy(C, dev_C
// Print the result
for (int i = 0; i < 10; i++) {
for (int j = 0; j < 10; j++) {
std::cout << C[i*N+j] << " ";
}
std::cout << std::endl;
}
// Free memory
cudaFree(dev_A);
cudaFree(dev_B);
cudaFree(dev_C);
cudaFreeHost(A);
cudaFreeHost(B);
cudaFreeHost(C);
return 0;