HPC Final 4-8
HPC Final 4-8
PRACTICAL – 9
THEORY:
Left side: Thread organization: a single kernel is launched from the host (the CPU) and is
executed in multiple threads on the device (the GPU); threads can be organized in three-
dimensional structures named blocks which can be, in turn, organized in three-dimensional
grids. The dimensions of blocks and grids are explicitly defined by the programmer.
Right side: Memory hierarchy: threads can access data from many different memories with
different scopes; registers and local memories are private for each thread. Shared memory let
threads belonging to the same block communicate, and has low access latency. All threads
can access the global memory, which suffers of high latencies but is cached since the
introduction of Fermi architecture. Texture and constant memories can be read from any
thread and feature a cache as well.
Working of CUDA:
Program Flow:
Output:
Output:
PRACTICAL – 10
Grid: The grid is the outermost layer of organization in CUDA. It consists of blocks, and
it represents the entire set of threads that will execute a kernel (a function that runs on the
GPU). Each grid is uniquely identified by a grid index.
Block: Within a grid, threads are grouped into blocks. Blocks are the next level of
organization and represent a group of threads that can cooperate with each other using shared
memory and synchronization primitives. Blocks are uniquely identified by a block index
within the grid.
Warp: A wrap (sometimes called a "warp" in CUDA terminology) is the smallest unit of
threads that can be scheduled together on a GPU core. In CUDA, a warp typically consists of
32 threads (this can vary depending on the GPU architecture). All threads within a warp
execute in lockstep, meaning they execute the same instruction at the same time.
Thread: The thread is the smallest unit of execution in CUDA. Each thread within a
block is assigned a unique thread index that can be used to calculate its position in data arrays
and to determine its behaviour within the kernel.
a = (int *)malloc(size);
b = (int *)malloc(size);
c = (int *)malloc(size);
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
free(a);
free(b);
free(c);
return 0;
}
Output:
PRACTICAL – 07
What is Profiler?
Profilers are tools used in software development to analyze and measure the performance of a
program. They help developers identify bottlenecks, memory leaks, and other inefficiencies
in their code. Profilers work by collecting data about various aspects of a program's execution,
such as CPU usage, memory usage, function call frequency, and execution time.
What is NVIDIA-Profiler?
nvprof is a command-line profiling tool that is part of the NVIDIA CUDA Toolkit. It is used
to profile and analyze the performance of CUDA applications running on NVIDIA GPUs.
The tool provides detailed information about the execution of the application, including the
time spent in each kernel, the memory usage, and the data transfer between the host and the
device nvprof is a powerful tool that can help developers optimize their applications for better
performance and efficiency.
Feature of NVIDIA-Profiler:
Measures kernel execution time.
Reports memory transfer sizes and times between host and device.
Provides API call statistics and timings.
Generates detailed reports of GPU hardware events (e.g., warp execution efficiency).
2. Select GPU: Ensure your system has a compatible NVIDIA GPU. Run the command
nvidia-smi to verify GPU availability and details. Select a change runtime type and
select T4 GPU which supports the CUDA environment.
3. Check CUDA Version: Confirm the CUDA Toolkit version on your system using nvcc
--version or cat /usr/local/cuda/version.txt. This step ensures compatibility with CUDA-
dependent applications.
5. Load CUDA Plugins: If working within an environment like PyTorch, ensure CUDA
support is enabled. This usually involves importing the necessary CUDA modules and
checking for GPU availability using torch.cuda.is_available().
CODE:
%%writefile AddIntsCUDA1.cu
#include <stdio.h>
#include <cuda.h>
#include <cuda_runtime_api.h>
int main() {
int a = 5, b = 9;
int *d_a, *d_b; // Device variable declaration
// Launch Kernel
AddIntsCUDA<<<1, 1>>>(d_a, d_b);
return 0;
}
Output:
PRACTICAL-06
Theory:
What is Intel V-Tune?
Intel VTune Profiler (often referred to simply as Intel V-Tune) is a performance analysis and
profiling tool designed to help developers optimize their software applications for better
performance on Intel processors. It provides detailed insights into how software utilizes
hardware resources, enabling developers to identify and address performance bottlenecks.
Here are the key aspects of Intel VTune.
Flowchart:
Key Features:
Advanced CPU profiling: Analyze CPU usage, identify performance
bottlenecks, and optimize CPU-bound applications.
Availability:
Included in the Intel oneAPI Base Toolkit, which is a core set of tools
and libraries for developing high-performance, data-centric applications
across diverse architectures.
Available as a stand-alone download.
Benefits:
Helps developers achieve high application performance on Intel hardware.
Provides detailed insights into system and application performance.
Helps optimize power consumption and identify performance
bottlenecks.
Supports various programming languages, including C, C++, Fortran,
DPC++, OpenMP, and Python.
Use Cases:
Optimizing CPU-bound applications for high performance.
Profiling and optimizing GPU-accelerated applications.
Identifying and fixing memory leaks and optimization memory-bound
applications.
Analyzing and optimizing system-wide performance.
Overview:
Intel VTune Performance Analyzer is a performance analysis and
optimization tool that helps developers identify and optimize
performance bottlenecks in their applications.
It provides detailed insights into system and application performance,
helping developers achieve high performance, power efficiency.
Key Features:
Advanced performance analysis: Analyze CPU, GPU, and memory usage
to identify performance bottlenecks.
Benefits:
Helps developers identify and optimize performance bottlenecks in their
applications.
Helps achieve high application performance, power efficiency, and
scalability on Intel hardware.
Use Cases:
Identifying and optimizing performance bottlenecks in CPU-bound
applications.
Analyzing and optimizing GPU-accelerated applications.
Optimizing memory-bound applications.
Developing high-performance, data-centric applications on Intel
hardware.
3. Conclusion:
Intel VTune Performance Analyzer is a powerful tool that helps developers optimize
their applications for high performance, power efficiency, and scalability on Intel
hardware. By providing detailed insights into system and application performance, it
helps developers identify and optimize performance bottlenecks, achieve high
application performance, and optimize power consumption.
PRACTICAL-08
THEORY:-
What is Load Balancing?
Load balancing refers to the process of distributing workloads evenly across multiple
computing resources, such as processors, nodes, or clusters, to ensure efficient utilization of
resources and to minimize processing time.
CODE:
import time as tm
import matplotlib.pyplot as plt
nodes=[1,2,3,4,5,6,7,8,9,10]
startingtime = []
endingtime = []
executiontime = []
for i in range(10):
startingtime.append(tm.time())
for i in range(10):
tm.sleep(0.1)
endingtime.append(tm.time())
for i in range(10):
executiontime.append(endingtime[i] - startingtime[i])
for i in range(10):
print(f"Node: {nodes[i]}, Execution Time: {executiontime[i]}")
fig = plt.figure(figsize=(5,5))
plt.plot(nodes, executiontime, marker = 'o')
plt.xlabel("Nodes")
plt.ylabel("Execution Time (seconds)")
OUTPUT:
PRACTICAL-04
CODE:
@cuda.jit
def add_arrays(a, b, res):
index = cuda.grid(1)
if index < a.size:
res[index] = a[index] + b[index]
print(result)
OUTPUT:
PRACTICAL-05
AIM: Write a program to check task distribution using Gprof.
THEORY:
What is Profiler:
Profiler can be used to understand code from a timing point of view, with the
optimization.
What is Gprof:
o my_program
2. Run the Program:
Execute your program as you normally would. This will generate profiling
data and store it in a file namedgmon.out (by default) in the current working
directory.
Once your program finishes running, use the gprof command followed by
the name of your program toanalyze the profiling data in gmon.out.
Flat profile: This shows the total time spent in each function of your program.
Call graph: This illustrates how functions call each other and how much time
is spent in each call.
Code (Simple):
Output:
Code (Complex):
Output: