0% found this document useful (0 votes)

13 views25 pages

HPC Final 4-8

High performance computing it is high processing computing it all about the processor and micro processor

Uploaded by

Ankit Rawat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views25 pages

HPC Final 4-8

High performance computing it is high processing computing it all about the processor and micro processor

Uploaded by

Ankit Rawat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Faculty of Engineering & Technology

High Performance Computing Laboratory

(203105430)
B. Tech CSE 4th Year 7th Semester

PRACTICAL – 9

AIM: Write a simple CUDA program to print “Hello World!”

THEORY:

What is CUDA programming?

CUDA stands for Compute Unified Device Architecture. It is an extension of C/C++

programming. CUDA is a programming language that uses the Graphical Processing Unit
(GPU). It is a parallel computing platform and an API model, Compute Unified Device
Architecture was developed by Nvidia. This allows computations to be performed in parallel
while providing well-formed speed. Using CUDA, one can harness the power of the Nvidia
GPU to perform common computing tasks, such as processing matrices and other linear
algebra operations, rather than simply performing graphical calculations.

Logical architecture of GPU:

Left side: Thread organization: a single kernel is launched from the host (the CPU) and is
executed in multiple threads on the device (the GPU); threads can be organized in three-
dimensional structures named blocks which can be, in turn, organized in three-dimensional
grids. The dimensions of blocks and grids are explicitly defined by the programmer.

Right side: Memory hierarchy: threads can access data from many different memories with
different scopes; registers and local memories are private for each thread. Shared memory let
threads belonging to the same block communicate, and has low access latency. All threads

Er. No. 210303105236

Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

can access the global memory, which suffers of high latencies but is cached since the
introduction of Fermi architecture. Texture and constant memories can be read from any
thread and feature a cache as well.

Working of CUDA:

 GPUs run one kernel (a group of tasks) at a time.

 Each kernel consists of blocks, which are independent groups of ALUs.
 Each block contains threads, which are levels of computation.
 The threads in each block typically work together to calculate a value.
 Threads in the same block can share memory.
 In CUDA, sending information from the CPU to the GPU is often the most typical part
of the computation.
 For each thread, local memory is the fastest, followed by shared memory, global, static,
and texture memory the slowest.

Program Flow:

 Load data into CPU memory.

 Copy data from CPU to GPU memory – e.g., cudaMemcpy(…,
cudaMemcpyHostToDevice).
Call GPU kernel using device variable – e.g., kernel<<<>>> (gpuVar).
 Copy results from GPU to CPU memory – e.g., cudaMemcpy(..,
cudaMemcpyDeviceToHost).

 Use results on CPU.

 CUDA program to print HELLO WORLD:

Er. No. 210303105236

Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

Output:

Er. No. 210303105236

Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

 With different number of blocks and thread:

Output:

Er. No. 210303105236

Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

PRACTICAL – 10

AIM: Write a CUDA program to add two arrays.

1. CUDA Programming Basics

GPU Logical Layers:

 Grid: The grid is the outermost layer of organization in CUDA. It consists of blocks, and
it represents the entire set of threads that will execute a kernel (a function that runs on the
GPU). Each grid is uniquely identified by a grid index.

 Block: Within a grid, threads are grouped into blocks. Blocks are the next level of
organization and represent a group of threads that can cooperate with each other using shared
memory and synchronization primitives. Blocks are uniquely identified by a block index
within the grid.

 Warp: A wrap (sometimes called a "warp" in CUDA terminology) is the smallest unit of
threads that can be scheduled together on a GPU core. In CUDA, a warp typically consists of
32 threads (this can vary depending on the GPU architecture). All threads within a warp
execute in lockstep, meaning they execute the same instruction at the same time.

 Thread: The thread is the smallest unit of execution in CUDA. Each thread within a
block is assigned a unique thread index that can be used to calculate its position in data arrays
and to determine its behaviour within the kernel.

Er. No. 210303105236

Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

2. CUDA Program to add two numbers.

Er. No. 210303105236

Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

3. CUDA program to add two arrays.

%%writefile array.cu
#include <stdio.h>
#include <cuda_runtime.h>
__global__ void add(int *a, int *b, int *c, int n)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < n)
c[i] = a[i] + b[i];
}
int main() {
int n = 10;
int *a, *b, *c;
int *d_a, *d_b, *d_c;
int size = n * sizeof(int);

a = (int *)malloc(size);
b = (int *)malloc(size);
c = (int *)malloc(size);

for (int i = 0; i < n; i++) {

a[i] = i;
b[i] = i * 2;
}

cudaMalloc((void **)&d_a, size);

cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);

cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);

cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

add<<<1, n>>>(d_a, d_b, d_c, n);

cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

for (int i = 0; i < n; i++)

printf("%d + %d = %d\n", a[i], b[i], c[i]);

cudaFree(d_a);
cudaFree(d_b);

Er. No. 210303105236

Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

cudaFree(d_c);
free(a);
free(b);
free(c);

return 0;
}

!nvcc array.cu -o array

!./array

Output:

Er. No. 210303105236

Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

PRACTICAL – 07

AIM:Analyze the code using Nvidia-Profilers.

What is Profiler?
Profilers are tools used in software development to analyze and measure the performance of a
program. They help developers identify bottlenecks, memory leaks, and other inefficiencies
in their code. Profilers work by collecting data about various aspects of a program's execution,
such as CPU usage, memory usage, function call frequency, and execution time.

What is NVIDIA-Profiler?
nvprof is a command-line profiling tool that is part of the NVIDIA CUDA Toolkit. It is used
to profile and analyze the performance of CUDA applications running on NVIDIA GPUs.
The tool provides detailed information about the execution of the application, including the
time spent in each kernel, the memory usage, and the data transfer between the host and the
device nvprof is a powerful tool that can help developers optimize their applications for better
performance and efficiency.

Feature of NVIDIA-Profiler:
 Measures kernel execution time.
 Reports memory transfer sizes and times between host and device.
 Provides API call statistics and timings.
 Generates detailed reports of GPU hardware events (e.g., warp execution efficiency).

Steps to perfrom NVIDIA-Profiler:

1. Run CUDA-Enabled Environment: Launch your development environment, ensuring it
supports CUDA. Google Colab which we can run multiple languages.

Er. No. 210303105236

Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

2. Select GPU: Ensure your system has a compatible NVIDIA GPU. Run the command
nvidia-smi to verify GPU availability and details. Select a change runtime type and
select T4 GPU which supports the CUDA environment.

3. Check CUDA Version: Confirm the CUDA Toolkit version on your system using nvcc
--version or cat /usr/local/cuda/version.txt. This step ensures compatibility with CUDA-
dependent applications.

4. Pip Install CUDA-Supported Libraries: Utilize pip to install CUDA-supported

Python libraries, such as TensorFlow or PyTorch. Example: pip install tensorflow-gpu.
We can install by the command ‘!pip install
git+https://github.com/andreinechaev/nvcc4jupyter.git‘

Er. No. 210303105236

Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

5. Load CUDA Plugins: If working within an environment like PyTorch, ensure CUDA
support is enabled. This usually involves importing the necessary CUDA modules and
checking for GPU availability using torch.cuda.is_available().

CODE:

%%writefile AddIntsCUDA1.cu
#include <stdio.h>
#include <cuda.h>
#include <cuda_runtime_api.h>

global void AddIntsCUDA(int a, int b) {

*a = *a + *b;
}

int main() {
int a = 5, b = 9;
int *d_a, *d_b; // Device variable declaration

// Allocation of Device Variables

cudaMalloc((void **)&d_a, sizeof(int));
cudaMalloc((void **)&d_b, sizeof(int));

// Copy Host Memory to Device Memory

cudaMemcpy(d_a, &a, sizeof(int), cudaMemcpyHostToDevice);
Er. No. 210303105236
Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

cudaMemcpy(d_b, &b, sizeof(int), cudaMemcpyHostToDevice);

// Launch Kernel
AddIntsCUDA<<<1, 1>>>(d_a, d_b);

// Copy Device Memory to Host Memory

cudaMemcpy(&a, d_a, sizeof(int), cudaMemcpyDeviceToHost);

printf("The answer is %d\n", a);

// Free Device Memory

cudaFree(d_a);
cudaFree(d_b);

return 0;
}

Output:

Er. No. 210303105236

Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

PRACTICAL-06

AIM: Use Intel V-Tune Performance Analyzer for Profiling.

Theory:
What is Intel V-Tune?
Intel VTune Profiler (often referred to simply as Intel V-Tune) is a performance analysis and
profiling tool designed to help developers optimize their software applications for better
performance on Intel processors. It provides detailed insights into how software utilizes
hardware resources, enabling developers to identify and address performance bottlenecks.
Here are the key aspects of Intel VTune.

Flowchart:

1. Intel V-Tune Profiler:

 Overview:
 Intel VTune Performance Profiler is a performance analysis and
optimization tool that helps developers achieve high application
performance on Intel hardware.

 It provides detailed insights into system and application performance,

helping developers identify and optimize performance bottlenecks.

 Key Features:
 Advanced CPU profiling: Analyze CPU usage, identify performance
bottlenecks, and optimize CPU-bound applications.

Er. No. 210303105236

Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

 GPU profiling: Profile GPU usage, identify performance bottlenecks, and

optimize GPUaccelerated applications.
 Memory profiling: Analyze memory usage, identify memory leaks, and
optimize memorybound applications.
 Threading profiling: Analyze thread execution, identify synchronization
issues, and optimize multi-threaded applications

 Availability:
 Included in the Intel oneAPI Base Toolkit, which is a core set of tools
and libraries for developing high-performance, data-centric applications
across diverse architectures.
 Available as a stand-alone download.

 Benefits:
 Helps developers achieve high application performance on Intel hardware.
 Provides detailed insights into system and application performance.
 Helps optimize power consumption and identify performance
bottlenecks.
 Supports various programming languages, including C, C++, Fortran,
DPC++, OpenMP, and Python.

 Use Cases:
 Optimizing CPU-bound applications for high performance.
 Profiling and optimizing GPU-accelerated applications.
 Identifying and fixing memory leaks and optimization memory-bound
applications.
 Analyzing and optimizing system-wide performance.

2. Intel V-Tune Performance Analyzer:

 Overview:
 Intel VTune Performance Analyzer is a performance analysis and
optimization tool that helps developers identify and optimize
performance bottlenecks in their applications.
 It provides detailed insights into system and application performance,
helping developers achieve high performance, power efficiency.

 Key Features:
 Advanced performance analysis: Analyze CPU, GPU, and memory usage
to identify performance bottlenecks.

Er. No. 210303105236

Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

 Hotspot analysis: Identify the most time-consuming functions and loops

in the application.
 Call graph analysis: Visualize the call graph to understand the
application's execution flow.

 Benefits:
 Helps developers identify and optimize performance bottlenecks in their
applications.
 Helps achieve high application performance, power efficiency, and
scalability on Intel hardware.

 Use Cases:
 Identifying and optimizing performance bottlenecks in CPU-bound
applications.
 Analyzing and optimizing GPU-accelerated applications.
 Optimizing memory-bound applications.
 Developing high-performance, data-centric applications on Intel
hardware.

3. Conclusion:
Intel VTune Performance Analyzer is a powerful tool that helps developers optimize
their applications for high performance, power efficiency, and scalability on Intel
hardware. By providing detailed insights into system and application performance, it
helps developers identify and optimize performance bottlenecks, achieve high
application performance, and optimize power consumption.

Er. No. 210303105236

Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

PRACTICAL-08

AIM: - Write a program to perform load distribution on GPU using CUDA.

THEORY:-
What is Load Balancing?
Load balancing refers to the process of distributing workloads evenly across multiple
computing resources, such as processors, nodes, or clusters, to ensure efficient utilization of
resources and to minimize processing time.

Why Load Balancing is important:

Load balancing is essential because it optimizes the use of computational resources, ensuring
that workloads are evenly distributed across available systems. This prevents some resources
from becoming overburdened while others remain idle, which in turn maximizes efficiency
and reduces processing time. By balancing the load, systems can handle more tasks
simultaneously, improving overall performance and throughput. Additionally, load balancing
supports scalability, allowing systems to maintain performance as the number of tasks or data
size increases. It also enhances reliability and fault tolerance by redistributing tasks from
failing or overloaded components, ensuring continuous operation and minimizing downtime.

What are the different load balancing techniques:

 There are several types of load balancing techniques. Some of them are:
 Static
 Dynamic
 Static technique is further divided into 3 types:
 Based on Data Partitioning:
o Array – Simple & Block Array
o Cyclic & Block
o Randomized
 Based on Task Partitioning
 Hierarchical
 Dynamic technique is further divided into 2 types:
 Centralized
o Master & Slave Technique
 Distributed

Er. No. 210303105236

Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

Advantages of Load Balancing:

 Optimized Resource Utilization: Load balancing ensures that all available resources
are used efficiently, preventing any single resource from becoming a bottleneck and
enhancing overall system performance.
 Reduced Latency: By distributing workloads evenly, load balancing minimizes
processing delays, leading to faster response times and quicker task completion.
 Adaptability: Dynamic load balancing techniques can adjust to changes in workload or
resource availability, making it easier to scale up or down as needed.
 Cost Efficiency: By maximizing resource utilization, load balancing helps in reducing
operational costs, as fewer resources are wasted and the need for additional hardware is
minimized.

CODE:

import time as tm
import matplotlib.pyplot as plt

nodes=[1,2,3,4,5,6,7,8,9,10]
startingtime = []
endingtime = []
executiontime = []

for i in range(10):
startingtime.append(tm.time())

for i in range(10):
tm.sleep(0.1)
endingtime.append(tm.time())

for i in range(10):
executiontime.append(endingtime[i] - startingtime[i])

for i in range(10):
print(f"Node: {nodes[i]}, Execution Time: {executiontime[i]}")

fig = plt.figure(figsize=(5,5))
plt.plot(nodes, executiontime, marker = 'o')
plt.xlabel("Nodes")
plt.ylabel("Execution Time (seconds)")

Er. No. 210303105236

Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

OUTPUT:

Er. No. 210303105236

Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

PRACTICAL-04

AIM: - Write a program on an unloaded cluster for several different

numbers of nodes and record the time taken in each case.
THEORY:
What is Cluster:
A cluster is a group of interconnected computers, or nodes, that work together to perform
complex computations efficiently. Each node in the cluster has its own processor, memory,
and storage, and they are connected through a high-speed network. By dividing tasks among
the nodes, clusters enable parallel processing, which significantly speeds up computations.
They are scalable, allowing more nodes to be added for greater computational power, and are
managed by specialized software that allocates resources efficiently. Clusters are commonly
used in research, simulations, and data-intensive applications where high processing power is
essential.

What are the types of cluster:

1. High-Performance Computing (HPC) Clusters: Designed for executing complex
calculations and simulations by using parallel processing across multiple nodes,
commonly used in scientific research and data-intensive tasks.
2. High-Availability (HA) Clusters: Ensure continuous service by automatically
switching to a backup node in case of failure, minimizing downtime and maintaining
system reliability.
3. Load-Balancing Clusters: Distribute incoming workloads across multiple servers to
optimize resource use and improve response times, commonly used in web services.
4. Storage Clusters: Manage large datasets by distributing data across multiple storage
nodes, providing redundancy and high availability, often used in distributed file
systems.
5. Grid Computing Clusters: Combine resources from multiple, often geographically
dispersed, computers to solve a single problem or task, functioning like a virtual
supercomputer.
6. Big Data Clusters: Designed for processing and analyzing massive datasets, utilizing
distributed computing frameworks like Hadoop or Spark for parallel data processing.
7. Beowulf Clusters: A type of HPC cluster built from standard, off-the-shelf hardware
and open-source software, offering a cost-effective solution for research and
educational purposes.

Er. No. 210303105236

Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

CODE:

from numba import cuda

import numpy as np

@cuda.jit
def add_arrays(a, b, res):
index = cuda.grid(1)
if index < a.size:
res[index] = a[index] + b[index]

# Define input arrays

a = np.array([11, 21, 13, 14, 15], dtype=np.float32)
b = np.array([10, 20, 30, 40, 50], dtype=np.float32)
result = np.empty_like(a)

# Allocate device memory

d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_result = cuda.to_device(result)

# Define block and grid sizes

threads_per_block = 256
blocks_per_grid = (a.size + threads_per_block - 1) // threads_per_block

# Launch the kernel

add_arrays[blocks_per_grid, threads_per_block](d_a, d_b, d_result)

# Copy result from device to host

d_result.copy_to_host(result)

print(result)

OUTPUT:

Er. No. 210303105236

Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

PRACTICAL-05
AIM: Write a program to check task distribution using Gprof.

THEORY:

What is Profiler:

Profiler can be used to understand code from a timing point of view, with the

objective of optimizing it to handle various runtime conditions or various loads.

Profiling results can be ingested by a compiler that provides profile-guided

optimization.
What is Gprof:

Gprof is a performance analysis tool used to profile applications to determine where

time is spent during programexecution.

Features of Gprof:

 Customize the output format and style

 Specify how Gprof analyzes its data
 Specify debugging/diagnostic output while Gprof performs its work
Steps to perform Gprof:

1. Compile with Profiling Enabled:

 During compilation, add the -pg flag to your compiler command. This
tells the compiler to includeprofiling information in the executable.
For example, if you're using GCC to compile a C

program named source.c, you'd use:gcc -pg source.c -

o my_program
2. Run the Program:

 Execute your program as you normally would. This will generate profiling
data and store it in a file namedgmon.out (by default) in the current working
directory.

3. Analyze with gprof:

Er. No. 210303105236

Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

 Once your program finishes running, use the gprof command followed by
the name of your program toanalyze the profiling data in gmon.out.

For instance : gprof my_program

This will generate a report containing two main sections:

 Flat profile: This shows the total time spent in each function of your program.
 Call graph: This illustrates how functions call each other and how much time
is spent in each call.

Code (Simple):

Er. No. 210303105236

Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

Output:

Er. No. 210303105236

Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

Code (Complex):

Er. No. 210303105236

Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

Output:

Er. No. 210303105236

01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
Parallel Programming With CUDA - Architecture, Analysis
No ratings yet
Parallel Programming With CUDA - Architecture, Analysis
93 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Intro To CUDA
No ratings yet
Intro To CUDA
76 pages
Introduction To CUDA C
No ratings yet
Introduction To CUDA C
67 pages
Laboratory Practice I (410246)
No ratings yet
Laboratory Practice I (410246)
28 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
21.L18 Intro To GPU and CUDA C
No ratings yet
21.L18 Intro To GPU and CUDA C
89 pages
CUDA PPT Anurita Unit3
No ratings yet
CUDA PPT Anurita Unit3
42 pages
1 Cuda
100% (1)
1 Cuda
173 pages
Basic-Cuda
No ratings yet
Basic-Cuda
49 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
Cuda 1
No ratings yet
Cuda 1
45 pages
HPC 1
No ratings yet
HPC 1
27 pages
Introduction To CUDA C 3
No ratings yet
Introduction To CUDA C 3
67 pages
Lecture3 Fundamentals of CUDA (Part1) - 2025
No ratings yet
Lecture3 Fundamentals of CUDA (Part1) - 2025
52 pages
CUDAProg Model
No ratings yet
CUDAProg Model
24 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
Introduction To Gpu Programming With Cuda and Openacc
100% (1)
Introduction To Gpu Programming With Cuda and Openacc
40 pages
Unit 5 - CUDA Architecture
No ratings yet
Unit 5 - CUDA Architecture
17 pages
CUDA Programming Invert
No ratings yet
CUDA Programming Invert
36 pages
Group A Assignment 4 (A) : Two Large Vectors
No ratings yet
Group A Assignment 4 (A) : Two Large Vectors
5 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
Combinepdf
No ratings yet
Combinepdf
28 pages
GPU Programming Slides 2
No ratings yet
GPU Programming Slides 2
37 pages
CUDA Programming: Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen
No ratings yet
CUDA Programming: Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen
28 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
Lec 1
No ratings yet
Lec 1
27 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
Crud Hello
No ratings yet
Crud Hello
4 pages
Cuuda Nvidai Guide - Part1
No ratings yet
Cuuda Nvidai Guide - Part1
15 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
No ratings yet
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
28 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
Govind 6
No ratings yet
Govind 6
4 pages
Endsem Imp HPC Unit 5
No ratings yet
Endsem Imp HPC Unit 5
24 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
Chapter7 GPU
No ratings yet
Chapter7 GPU
45 pages
A Beginner'S Guide To Programming Gpus With Cuda: Mike Peardon
No ratings yet
A Beginner'S Guide To Programming Gpus With Cuda: Mike Peardon
21 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
CUDA
No ratings yet
CUDA
18 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Cuda C
No ratings yet
Cuda C
70 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Introduccion CUDA C
No ratings yet
Introduccion CUDA C
51 pages
MS Office Made Easy-Complete Notes
92% (13)
MS Office Made Easy-Complete Notes
73 pages
CUDA
No ratings yet
CUDA
33 pages
Computer System Servicing Grade 11
89% (9)
Computer System Servicing Grade 11
6 pages
Aca Lab Manual Final
No ratings yet
Aca Lab Manual Final
28 pages
Halliburton Otis B Positioning (Shifting) Tool: Slickline Services Mechanical Intervention
No ratings yet
Halliburton Otis B Positioning (Shifting) Tool: Slickline Services Mechanical Intervention
2 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Devi Bhagavatam Malayalam PDF
0% (1)
Devi Bhagavatam Malayalam PDF
5 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
The Hacker's Home
No ratings yet
The Hacker's Home
4 pages
FRAMEMAC Instruction & Control System Manual
No ratings yet
FRAMEMAC Instruction & Control System Manual
59 pages
Introduction To ESP32
No ratings yet
Introduction To ESP32
8 pages
Logic Design Fundamentals - V1
100% (1)
Logic Design Fundamentals - V1
73 pages
2014 TV Firmware Upgrade Instruction T-NT14MDEUC
No ratings yet
2014 TV Firmware Upgrade Instruction T-NT14MDEUC
5 pages
bANK mANAGEMENT
No ratings yet
bANK mANAGEMENT
111 pages
Data Sheet
No ratings yet
Data Sheet
2 pages
Instrument Panel: Section
No ratings yet
Instrument Panel: Section
12 pages
KVT 715 DVD
No ratings yet
KVT 715 DVD
76 pages
Picbasic Pro
No ratings yet
Picbasic Pro
55 pages
PL 2023 - September
No ratings yet
PL 2023 - September
80 pages
Solution Manual For Modern Processor Design by John Paul Shen and Mikko H. Lipasti
No ratings yet
Solution Manual For Modern Processor Design by John Paul Shen and Mikko H. Lipasti
11 pages
SPUTNIK360 SWITCH & PROBE - Instructions
100% (1)
SPUTNIK360 SWITCH & PROBE - Instructions
1 page
FA-ZETTENG - UK06 Download To Panel and GUI Via USB
No ratings yet
FA-ZETTENG - UK06 Download To Panel and GUI Via USB
20 pages
Reference Temperature Calibrator RTC 156 157 Datasheet
No ratings yet
Reference Temperature Calibrator RTC 156 157 Datasheet
15 pages
Installation Manual EX-M02: Test Unit For Electronic Modules
No ratings yet
Installation Manual EX-M02: Test Unit For Electronic Modules
4 pages
Output Log
No ratings yet
Output Log
6 pages
India Government's Policy On Open Standards For E-Governance
No ratings yet
India Government's Policy On Open Standards For E-Governance
13 pages
Doraemon Story of Seasons On Steam
No ratings yet
Doraemon Story of Seasons On Steam
5 pages
Hardware & Software Specifications Guide
No ratings yet
Hardware & Software Specifications Guide
11 pages
NTR Help Desk Data Sheet
No ratings yet
NTR Help Desk Data Sheet
2 pages
UTFT Supported Display Modules & Controllers
No ratings yet
UTFT Supported Display Modules & Controllers
2 pages
Spesifikasi Laptop
No ratings yet
Spesifikasi Laptop
28 pages
Arduino Micro-Controller: Members
No ratings yet
Arduino Micro-Controller: Members
3 pages
HC Trade Return
No ratings yet
HC Trade Return
10 pages
GPU Vs FPGA
No ratings yet
GPU Vs FPGA
18 pages

HPC Final 4-8

Uploaded by

HPC Final 4-8

Uploaded by

Faculty of Engineering & Technology

High Performance Computing Laboratory

AIM: Write a simple CUDA program to print “Hello World!”

What is CUDA programming?

CUDA stands for Compute Unified Device Architecture. It is an extension of C/C++

Logical architecture of GPU:

Er. No. 210303105236

 GPUs run one kernel (a group of tasks) at a time.

 Load data into CPU memory.

 Use results on CPU.

 CUDA program to print HELLO WORLD:

Er. No. 210303105236

Er. No. 210303105236

 With different number of blocks and thread:

Er. No. 210303105236

AIM: Write a CUDA program to add two arrays.

1. CUDA Programming Basics

GPU Logical Layers:

Er. No. 210303105236

2. CUDA Program to add two numbers.

Er. No. 210303105236

3. CUDA program to add two arrays.

for (int i = 0; i < n; i++) {

cudaMalloc((void **)&d_a, size);

cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);

add<<<1, n>>>(d_a, d_b, d_c, n);

cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

for (int i = 0; i < n; i++)

Er. No. 210303105236

!nvcc array.cu -o array

Er. No. 210303105236

AIM:Analyze the code using Nvidia-Profilers.

Steps to perfrom NVIDIA-Profiler:

Er. No. 210303105236

4. Pip Install CUDA-Supported Libraries: Utilize pip to install CUDA-supported

Er. No. 210303105236

__global__ void AddIntsCUDA(int *a, int *b) {

// Allocation of Device Variables

// Copy Host Memory to Device Memory

cudaMemcpy(d_b, &b, sizeof(int), cudaMemcpyHostToDevice);

// Copy Device Memory to Host Memory

printf("The answer is %d\n", a);

// Free Device Memory

Er. No. 210303105236

AIM: Use Intel V-Tune Performance Analyzer for Profiling.

1. Intel V-Tune Profiler:

 It provides detailed insights into system and application performance,

Er. No. 210303105236

 GPU profiling: Profile GPU usage, identify performance bottlenecks, and

2. Intel V-Tune Performance Analyzer:

Er. No. 210303105236

 Hotspot analysis: Identify the most time-consuming functions and loops

Er. No. 210303105236

AIM: - Write a program to perform load distribution on GPU using CUDA.

Why Load Balancing is important:

What are the different load balancing techniques:

Er. No. 210303105236

Advantages of Load Balancing:

Er. No. 210303105236

Er. No. 210303105236

AIM: - Write a program on an unloaded cluster for several different

What are the types of cluster:

Er. No. 210303105236

from numba import cuda

# Define input arrays

# Allocate device memory

# Define block and grid sizes

# Launch the kernel

# Copy result from device to host

Er. No. 210303105236

objective of optimizing it to handle various runtime conditions or various loads.

Profiling results can be ingested by a compiler that provides profile-guided

Gprof is a performance analysis tool used to profile applications to determine where

time is spent during programexecution.

 Customize the output format and style

1. Compile with Profiling Enabled:

global void AddIntsCUDA(int a, int b) {