HPC 4 B

The document discusses matrix multiplication using CUDA C. It explains what CUDA is, the steps to perform matrix multiplication on the GPU including memory allocation and kernel launch, and provides an example CUDA program to multiply two matrices on the GPU.

Uploaded by

Nayan Jadhav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views5 pages

HPC 4 B

Uploaded by

Nayan Jadhav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Group A

Assignment 4(B)

Title of the Assignment: Write a Program for Matrix Multiplication using CUDA C

Objective of the Assignment: Students should be able to performProgram for Matrix Multiplication
using CUDA C

Prerequisite:
1. CUDA Concept
2. Matrix Multiplication
3. How to execute Program in CUDA Environment
---------------------------------------------------------------------------------------------------------------

Contents for Theory:

1. What is CUDA
2. Matrix Multiplication
3. Execution of CUDA Environment
--------------------------------------------------------------------------------------------------------------
What is CUDA

CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model
developed by NVIDIA. It allows developers to use the power of NVIDIA graphics processing units (GPUs)
to accelerate computation tasks in various applications, including scientific computing, machine learning,
and computer vision.CUDA provides a set of programming APIs, libraries, and tools that enable developers
to write and execute parallel code on NVIDIA GPUs. It supports popular programming languages like C,
C++, and Python, and provides a simple programming model that abstracts away much of the low-level
details of GPU architecture.

Using CUDA, developers can exploit the massive parallelism and high computational power of GPUs to
accelerate computationally intensive tasks, such as matrix operations, image processing, and deep learning.
CUDA has become an important tool for scientific research and is widely used in fields like physics,
chemistry, biology, and engineering.
Steps for Matrix Multiplication using CUDA
Here are the steps for implementing matrix multiplication using CUDA C:

1. Matrix Initialization: The first step is to initialize the matrices that you want to multiply. You can use
standard C or CUDA functions to allocate memory for the matrices and initialize their values. The
matrices are usually represented as 2D arrays.
2. Memory Allocation: The next step is to allocate memory on the host and the device for the matrices.
You can use the standard C malloc function to allocate memory on the host and the CUDA function
cudaMalloc() to allocate memory on the device.
3. Data Transfer: The third step is to transfer data between the host and the device. You can use the
CUDA function cudaMemcpy() to transfer data from the host to the device or vice versa.
4. Kernel Launch: The fourth step is to launch the CUDA kernel that will perform the matrix
multiplication on the device. You can use the <<<...>>> syntax to specify the number of blocks and
threads to use. Each thread in the kernel will compute one element of the output matrix.
5. Device Synchronization: The fifth step is to synchronize the device to ensure that all kernel
executions have completed before proceeding. You can use the CUDA function
cudaDeviceSynchronize() to synchronize the device.
6. Data Retrieval: The sixth step is to retrieve the result of the computation from the device to the host.
You can use the CUDA function cudaMemcpy() to transfer data from the device to the host.
7. Memory Deallocation: The final step is to deallocate the memory that was allocated on the host and
the device. You can use the C free function to deallocate memory on the host and the CUDA function

cudaFree() to deallocate memory on the device.

Execution of Program over CUDA Environment

1. Install CUDA Toolkit: First, you need to install the CUDA Toolkit on your system. You can
download the CUDA Toolkit from the NVIDIA website and follow the installation instructions
provided.
2. Set up CUDA environment: Once the CUDA Toolkit is installed, you need to set up the CUDA
environment on your system. This involves setting the PATH and LD_LIBRARY_PATH
environment variables to the appropriate directories.
3. Write the CUDA program: You need to write a CUDA program that performs the addition of two
large vectors. You can use a text editor to write the program and save it with a .cu extension.
4. Compile the CUDA program: You need to compile the CUDA program using the nvcc compiler that
comes with the CUDA Toolkit. The command to compile the program is:

5. This will generate an executable program named program_name.

Run the CUDA program: Finally, you can run the CUDA program by executing the executable file
generated in the previous step. The command to run the program is:
Questions:

1. What are the advantages of using CUDA to perform matrix multiplication compared to using a CPU?
2. How do you handle matrices that are too large to fit in GPU memory in CUDA matrix
multiplication?
3. How do you optimize the performance of the CUDA program for matrix multiplication?
4. How do you ensure correctness of the CUDA program for matrix multiplication and verify the
results?

-------------------------------------------Program--------------------------------------------------------
#include <cuda_runtime.h>
#include <iostream>
__global__ void matmul(int* A, int* B, int* C, int N) {
int Row = blockIdx.y*blockDim.y+threadIdx.y;
int Col = blockIdx.x*blockDim.x+threadIdx.x;
if (Row < N && Col < N) {
int Pvalue = 0;
for (int k = 0; k < N; k++) {
Pvalue += A[Row*N+k] * B[k*N+Col];
}
C[Row*N+Col] = Pvalue;
}
}
int main() {
int N = 512;
int size = N * N * sizeof(int);
int* A, * B, * C;
int* dev_A, * dev_B, * dev_C;
cudaMallocHost(&A, size);
cudaMallocHost(&B, size);
cudaMallocHost(&C, size);
cudaMalloc(&dev_A, size);
cudaMalloc(&dev_B, size);
cudaMalloc(&dev_C, size);
// Initialize matrices A and B
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
A[i*N+j] = i*N+j;
B[i*N+j] = j*N+i;
}
}
cudaMemcpy(dev_A, A, size,
cudaMemcpyHostToDevice);
cudaMemcpy(dev_B, B, size,
cudaMemcpyHostToDevice);
dim3 dimBlock(16, 16);
dim3 dimGrid(N/dimBlock.x, N/dimBlock.y);
matmul<<<dimGrid, dimBlock>>>(dev_A, dev_B,
dev_C, N);
cudaMemcpy(C, dev_C
// Print the result
for (int i = 0; i < 10; i++) {
for (int j = 0; j < 10; j++) {
std::cout << C[i*N+j] << " ";
}
std::cout << std::endl;
}
// Free memory
cudaFree(dev_A);
cudaFree(dev_B);
cudaFree(dev_C);
cudaFreeHost(A);
cudaFreeHost(B);
cudaFreeHost(C);
return 0;

01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
LabVIEW Graphical Programming (4th Ed) (Gary and Richard)
100% (2)
LabVIEW Graphical Programming (4th Ed) (Gary and Richard)
625 pages
HPC-Practical-4Addition of Two Large Vectors
No ratings yet
HPC-Practical-4Addition of Two Large Vectors
4 pages
CUDA Exercises
No ratings yet
CUDA Exercises
185 pages
Pre-Employment Exam
100% (6)
Pre-Employment Exam
10 pages
CUDA - Part 1 LMS
No ratings yet
CUDA - Part 1 LMS
51 pages
Digital Bot (En)
No ratings yet
Digital Bot (En)
20 pages
Combinepdf
No ratings yet
Combinepdf
28 pages
Introduction To Gpu Programming With Cuda and Openacc
100% (1)
Introduction To Gpu Programming With Cuda and Openacc
40 pages
Intro To CUDA
No ratings yet
Intro To CUDA
76 pages
CUDA Part-1
No ratings yet
CUDA Part-1
52 pages
Matrix Mult
100% (1)
Matrix Mult
55 pages
2023 CSC14120 Lecture01 CUDAIntroduction
No ratings yet
2023 CSC14120 Lecture01 CUDAIntroduction
32 pages
SIS8900 Μtca For Physics Rtm User Manual
No ratings yet
SIS8900 Μtca For Physics Rtm User Manual
21 pages
Grade 8 Computer Question Bank
No ratings yet
Grade 8 Computer Question Bank
3 pages
Lecture 4
No ratings yet
Lecture 4
48 pages
Chapter 2 (Web Design and Programming)
No ratings yet
Chapter 2 (Web Design and Programming)
74 pages
Threads
No ratings yet
Threads
54 pages
SketchUp 2016 Help
No ratings yet
SketchUp 2016 Help
175 pages
Parallel Computing Lab4
No ratings yet
Parallel Computing Lab4
13 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
Basic-Cuda
No ratings yet
Basic-Cuda
49 pages
CUDA PPT Anurita Unit3
No ratings yet
CUDA PPT Anurita Unit3
42 pages
Lecture3 Fundamentals of CUDA (Part1) - 2025
No ratings yet
Lecture3 Fundamentals of CUDA (Part1) - 2025
52 pages
CUDA Programming Model
No ratings yet
CUDA Programming Model
14 pages
Cuda Add Mult
No ratings yet
Cuda Add Mult
3 pages
Csnb594csnb4423 Lab 5 01a Harveen Velan Sw0104101
No ratings yet
Csnb594csnb4423 Lab 5 01a Harveen Velan Sw0104101
19 pages
Rishi
No ratings yet
Rishi
30 pages
CUDA Matrix Multiplication: Programming Languages Course
No ratings yet
CUDA Matrix Multiplication: Programming Languages Course
5 pages
Group A Assignment 4 (A) : Two Large Vectors
No ratings yet
Group A Assignment 4 (A) : Two Large Vectors
5 pages
LP1 1
No ratings yet
LP1 1
129 pages
Cuda
No ratings yet
Cuda
4 pages
GPU Programming Slides 2
No ratings yet
GPU Programming Slides 2
37 pages
CUDAProg Model
No ratings yet
CUDAProg Model
24 pages
PyCUDA AH PDF
No ratings yet
PyCUDA AH PDF
16 pages
HPC File
No ratings yet
HPC File
22 pages
EEI3346 Final Written Paper - 9th January2022
No ratings yet
EEI3346 Final Written Paper - 9th January2022
10 pages
CUDA Programming: Johan Seland Johan - Seland@sintef - No
No ratings yet
CUDA Programming: Johan Seland Johan - Seland@sintef - No
76 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
PDC Assignment
No ratings yet
PDC Assignment
9 pages
CUDA
No ratings yet
CUDA
3 pages
Allocate The Device Memory Where We Will Copy M
No ratings yet
Allocate The Device Memory Where We Will Copy M
2 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
Height Comparison - Comparing Heights Visually With Chart
No ratings yet
Height Comparison - Comparing Heights Visually With Chart
1 page
Math1059 - Calculus
100% (1)
Math1059 - Calculus
98 pages
Lecture17 12
No ratings yet
Lecture17 12
86 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
GPU Series III CUDA Compilation Host Side 1721302802
No ratings yet
GPU Series III CUDA Compilation Host Side 1721302802
8 pages
CUDA C Best Practices Guide
No ratings yet
CUDA C Best Practices Guide
73 pages
BECOA157 Parallel Matrix Multiplication
No ratings yet
BECOA157 Parallel Matrix Multiplication
3 pages
Cuda Examples
No ratings yet
Cuda Examples
5 pages
20 Quiz 14
No ratings yet
20 Quiz 14
12 pages
cs239 Ejer1
No ratings yet
cs239 Ejer1
2 pages
HPC (Pra 04)
No ratings yet
HPC (Pra 04)
11 pages
Flipkart TBBD '23 Cheat Sheet - Electronics
No ratings yet
Flipkart TBBD '23 Cheat Sheet - Electronics
23 pages
Mamindla Sathvika Lab8
No ratings yet
Mamindla Sathvika Lab8
7 pages
CUDA MatrixMultiplication
No ratings yet
CUDA MatrixMultiplication
2 pages
TCT2 - PDH Principles - 1688735713572
No ratings yet
TCT2 - PDH Principles - 1688735713572
52 pages
Lab Report 6
No ratings yet
Lab Report 6
12 pages
ArchiMate D2.3 Readability and Usability Guidelines
No ratings yet
ArchiMate D2.3 Readability and Usability Guidelines
63 pages
FLOODWALL A Real-Time Flash Flood Monitoring and Forecasting System Using IoT
No ratings yet
FLOODWALL A Real-Time Flash Flood Monitoring and Forecasting System Using IoT
13 pages
House of Leaves (TV Pilot Script)
No ratings yet
House of Leaves (TV Pilot Script)
60 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
CUDA
No ratings yet
CUDA
33 pages
IFX Expo Limassol - Exhibitor Manual
No ratings yet
IFX Expo Limassol - Exhibitor Manual
19 pages
Lab 1 Parallel
No ratings yet
Lab 1 Parallel
4 pages
Aca Lab Manual Final
No ratings yet
Aca Lab Manual Final
28 pages
Egov Bancnet Corporate User'S Manual
No ratings yet
Egov Bancnet Corporate User'S Manual
66 pages
Wimax: Submitted To:-Ms. Pooja Dhiman Submitted By: - Priyanka Verma Ranjana Rani
No ratings yet
Wimax: Submitted To:-Ms. Pooja Dhiman Submitted By: - Priyanka Verma Ranjana Rani
13 pages
Async Tasks With Apache Airflow
No ratings yet
Async Tasks With Apache Airflow
111 pages
Datasheet of Addressable Visual AI 2 Loop Fire Alarm Control Panel H5800
No ratings yet
Datasheet of Addressable Visual AI 2 Loop Fire Alarm Control Panel H5800
2 pages
Volume Analysis in Trading
No ratings yet
Volume Analysis in Trading
7 pages
BSM-3000 09
No ratings yet
BSM-3000 09
8 pages
MCA 401 (Unit 05)
No ratings yet
MCA 401 (Unit 05)
6 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
CSIT Nepal: Subject: Fundamental of Computer Programming Year: 2065
No ratings yet
CSIT Nepal: Subject: Fundamental of Computer Programming Year: 2065
8 pages
022 - SK Santan - Sebutharga Pemasangan Wireless AP
No ratings yet
022 - SK Santan - Sebutharga Pemasangan Wireless AP
1 page
Using Charles Proxy
No ratings yet
Using Charles Proxy
7 pages
LB25 Manual Books Check List 9542 PDF
No ratings yet
LB25 Manual Books Check List 9542 PDF
5 pages
Individual Assignment 02
No ratings yet
Individual Assignment 02
2 pages
Restricted and Banned Commodities FINAL2
No ratings yet
Restricted and Banned Commodities FINAL2
1 page
Erori de Comunicatie
No ratings yet
Erori de Comunicatie
1 page
Mastering CUDA C Programming
From Everand
Mastering CUDA C Programming
Ed Norex
No ratings yet
Mastering CUDA C++ Programming: A Comprehensive Guidebook
From Everand
Mastering CUDA C++ Programming: A Comprehensive Guidebook
Brett Neutreon
No ratings yet
Mastering CUDA Python Programming
From Everand
Mastering CUDA Python Programming
Ed A Norex
No ratings yet
CUDA Programming in C: From Basics to Expert Proficiency
From Everand
CUDA Programming in C: From Basics to Expert Proficiency
William Smith
No ratings yet
LPIC-1 Primer
From Everand
LPIC-1 Primer
John Greene
4.5/5 (3)

HPC 4 B

Uploaded by

HPC 4 B

Uploaded by

Group A

Contents for Theory:

cudaFree() to deallocate memory on the device.

Execution of Program over CUDA Environment

5. This will generate an executable program named program_name.

You might also like