0% found this document useful (0 votes)
13 views25 pages

HPC Final 4-8

High performance computing it is high processing computing it all about the processor and micro processor

Uploaded by

Ankit Rawat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views25 pages

HPC Final 4-8

High performance computing it is high processing computing it all about the processor and micro processor

Uploaded by

Ankit Rawat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Faculty of Engineering & Technology

High Performance Computing Laboratory


(203105430)
B. Tech CSE 4th Year 7th Semester

PRACTICAL – 9

AIM: Write a simple CUDA program to print “Hello World!”

THEORY:

What is CUDA programming?

CUDA stands for Compute Unified Device Architecture. It is an extension of C/C++


programming. CUDA is a programming language that uses the Graphical Processing Unit
(GPU). It is a parallel computing platform and an API model, Compute Unified Device
Architecture was developed by Nvidia. This allows computations to be performed in parallel
while providing well-formed speed. Using CUDA, one can harness the power of the Nvidia
GPU to perform common computing tasks, such as processing matrices and other linear
algebra operations, rather than simply performing graphical calculations.

Logical architecture of GPU:

Left side: Thread organization: a single kernel is launched from the host (the CPU) and is
executed in multiple threads on the device (the GPU); threads can be organized in three-
dimensional structures named blocks which can be, in turn, organized in three-dimensional
grids. The dimensions of blocks and grids are explicitly defined by the programmer.

Right side: Memory hierarchy: threads can access data from many different memories with
different scopes; registers and local memories are private for each thread. Shared memory let
threads belonging to the same block communicate, and has low access latency. All threads

Er. No. 210303105236


Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

can access the global memory, which suffers of high latencies but is cached since the
introduction of Fermi architecture. Texture and constant memories can be read from any
thread and feature a cache as well.

Working of CUDA:

 GPUs run one kernel (a group of tasks) at a time.


 Each kernel consists of blocks, which are independent groups of ALUs.
 Each block contains threads, which are levels of computation.
 The threads in each block typically work together to calculate a value.
 Threads in the same block can share memory.
 In CUDA, sending information from the CPU to the GPU is often the most typical part
of the computation.
 For each thread, local memory is the fastest, followed by shared memory, global, static,
and texture memory the slowest.

Program Flow:

 Load data into CPU memory.


 Copy data from CPU to GPU memory – e.g., cudaMemcpy(…,
cudaMemcpyHostToDevice).
Call GPU kernel using device variable – e.g., kernel<<<>>> (gpuVar).
 Copy results from GPU to CPU memory – e.g., cudaMemcpy(..,
cudaMemcpyDeviceToHost).

 Use results on CPU.

 CUDA program to print HELLO WORLD:

Er. No. 210303105236


Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

Output:

Er. No. 210303105236


Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

 With different number of blocks and thread:

Output:

Er. No. 210303105236


Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

PRACTICAL – 10

AIM: Write a CUDA program to add two arrays.

1. CUDA Programming Basics

GPU Logical Layers:

 Grid: The grid is the outermost layer of organization in CUDA. It consists of blocks, and
it represents the entire set of threads that will execute a kernel (a function that runs on the
GPU). Each grid is uniquely identified by a grid index.

 Block: Within a grid, threads are grouped into blocks. Blocks are the next level of
organization and represent a group of threads that can cooperate with each other using shared
memory and synchronization primitives. Blocks are uniquely identified by a block index
within the grid.

 Warp: A wrap (sometimes called a "warp" in CUDA terminology) is the smallest unit of
threads that can be scheduled together on a GPU core. In CUDA, a warp typically consists of
32 threads (this can vary depending on the GPU architecture). All threads within a warp
execute in lockstep, meaning they execute the same instruction at the same time.

 Thread: The thread is the smallest unit of execution in CUDA. Each thread within a
block is assigned a unique thread index that can be used to calculate its position in data arrays
and to determine its behaviour within the kernel.

Er. No. 210303105236


Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

2. CUDA Program to add two numbers.

Er. No. 210303105236


Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

3. CUDA program to add two arrays.


%%writefile array.cu
#include <stdio.h>
#include <cuda_runtime.h>
__global__ void add(int *a, int *b, int *c, int n)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < n)
c[i] = a[i] + b[i];
}
int main() {
int n = 10;
int *a, *b, *c;
int *d_a, *d_b, *d_c;
int size = n * sizeof(int);

a = (int *)malloc(size);
b = (int *)malloc(size);
c = (int *)malloc(size);

for (int i = 0; i < n; i++) {


a[i] = i;
b[i] = i * 2;
}

cudaMalloc((void **)&d_a, size);


cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);

cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);


cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

add<<<1, n>>>(d_a, d_b, d_c, n);

cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

for (int i = 0; i < n; i++)


printf("%d + %d = %d\n", a[i], b[i], c[i]);

cudaFree(d_a);
cudaFree(d_b);

Er. No. 210303105236


Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

cudaFree(d_c);
free(a);
free(b);
free(c);

return 0;
}

!nvcc array.cu -o array


!./array

Output:

Er. No. 210303105236


Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

PRACTICAL – 07

AIM:Analyze the code using Nvidia-Profilers.

What is Profiler?
Profilers are tools used in software development to analyze and measure the performance of a
program. They help developers identify bottlenecks, memory leaks, and other inefficiencies
in their code. Profilers work by collecting data about various aspects of a program's execution,
such as CPU usage, memory usage, function call frequency, and execution time.

What is NVIDIA-Profiler?
nvprof is a command-line profiling tool that is part of the NVIDIA CUDA Toolkit. It is used
to profile and analyze the performance of CUDA applications running on NVIDIA GPUs.
The tool provides detailed information about the execution of the application, including the
time spent in each kernel, the memory usage, and the data transfer between the host and the
device nvprof is a powerful tool that can help developers optimize their applications for better
performance and efficiency.

Feature of NVIDIA-Profiler:
 Measures kernel execution time.
 Reports memory transfer sizes and times between host and device.
 Provides API call statistics and timings.
 Generates detailed reports of GPU hardware events (e.g., warp execution efficiency).

Steps to perfrom NVIDIA-Profiler:


1. Run CUDA-Enabled Environment: Launch your development environment, ensuring it
supports CUDA. Google Colab which we can run multiple languages.

Er. No. 210303105236


Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

2. Select GPU: Ensure your system has a compatible NVIDIA GPU. Run the command
nvidia-smi to verify GPU availability and details. Select a change runtime type and
select T4 GPU which supports the CUDA environment.

3. Check CUDA Version: Confirm the CUDA Toolkit version on your system using nvcc
--version or cat /usr/local/cuda/version.txt. This step ensures compatibility with CUDA-
dependent applications.

4. Pip Install CUDA-Supported Libraries: Utilize pip to install CUDA-supported


Python libraries, such as TensorFlow or PyTorch. Example: pip install tensorflow-gpu.
We can install by the command ‘!pip install
git+https://github.com/andreinechaev/nvcc4jupyter.git‘

Er. No. 210303105236


Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

5. Load CUDA Plugins: If working within an environment like PyTorch, ensure CUDA
support is enabled. This usually involves importing the necessary CUDA modules and
checking for GPU availability using torch.cuda.is_available().

CODE:

%%writefile AddIntsCUDA1.cu
#include <stdio.h>
#include <cuda.h>
#include <cuda_runtime_api.h>

__global__ void AddIntsCUDA(int *a, int *b) {


*a = *a + *b;
}

int main() {
int a = 5, b = 9;
int *d_a, *d_b; // Device variable declaration

// Allocation of Device Variables


cudaMalloc((void **)&d_a, sizeof(int));
cudaMalloc((void **)&d_b, sizeof(int));

// Copy Host Memory to Device Memory


cudaMemcpy(d_a, &a, sizeof(int), cudaMemcpyHostToDevice);
Er. No. 210303105236
Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

cudaMemcpy(d_b, &b, sizeof(int), cudaMemcpyHostToDevice);

// Launch Kernel
AddIntsCUDA<<<1, 1>>>(d_a, d_b);

// Copy Device Memory to Host Memory


cudaMemcpy(&a, d_a, sizeof(int), cudaMemcpyDeviceToHost);

printf("The answer is %d\n", a);

// Free Device Memory


cudaFree(d_a);
cudaFree(d_b);

return 0;
}

Output:

Er. No. 210303105236


Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

PRACTICAL-06

AIM: Use Intel V-Tune Performance Analyzer for Profiling.

Theory:
What is Intel V-Tune?
Intel VTune Profiler (often referred to simply as Intel V-Tune) is a performance analysis and
profiling tool designed to help developers optimize their software applications for better
performance on Intel processors. It provides detailed insights into how software utilizes
hardware resources, enabling developers to identify and address performance bottlenecks.
Here are the key aspects of Intel VTune.

Flowchart:

1. Intel V-Tune Profiler:


 Overview:
 Intel VTune Performance Profiler is a performance analysis and
optimization tool that helps developers achieve high application
performance on Intel hardware.

 It provides detailed insights into system and application performance,


helping developers identify and optimize performance bottlenecks.

 Key Features:
 Advanced CPU profiling: Analyze CPU usage, identify performance
bottlenecks, and optimize CPU-bound applications.

Er. No. 210303105236


Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

 GPU profiling: Profile GPU usage, identify performance bottlenecks, and


optimize GPUaccelerated applications.
 Memory profiling: Analyze memory usage, identify memory leaks, and
optimize memorybound applications.
 Threading profiling: Analyze thread execution, identify synchronization
issues, and optimize multi-threaded applications

 Availability:
 Included in the Intel oneAPI Base Toolkit, which is a core set of tools
and libraries for developing high-performance, data-centric applications
across diverse architectures.
 Available as a stand-alone download.

 Benefits:
 Helps developers achieve high application performance on Intel hardware.
 Provides detailed insights into system and application performance.
 Helps optimize power consumption and identify performance
bottlenecks.
 Supports various programming languages, including C, C++, Fortran,
DPC++, OpenMP, and Python.

 Use Cases:
 Optimizing CPU-bound applications for high performance.
 Profiling and optimizing GPU-accelerated applications.
 Identifying and fixing memory leaks and optimization memory-bound
applications.
 Analyzing and optimizing system-wide performance.

2. Intel V-Tune Performance Analyzer:

 Overview:
 Intel VTune Performance Analyzer is a performance analysis and
optimization tool that helps developers identify and optimize
performance bottlenecks in their applications.
 It provides detailed insights into system and application performance,
helping developers achieve high performance, power efficiency.

 Key Features:
 Advanced performance analysis: Analyze CPU, GPU, and memory usage
to identify performance bottlenecks.

Er. No. 210303105236


Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

 Hotspot analysis: Identify the most time-consuming functions and loops


in the application.
 Call graph analysis: Visualize the call graph to understand the
application's execution flow.

 Benefits:
 Helps developers identify and optimize performance bottlenecks in their
applications.
 Helps achieve high application performance, power efficiency, and
scalability on Intel hardware.

 Use Cases:
 Identifying and optimizing performance bottlenecks in CPU-bound
applications.
 Analyzing and optimizing GPU-accelerated applications.
 Optimizing memory-bound applications.
 Developing high-performance, data-centric applications on Intel
hardware.

3. Conclusion:
Intel VTune Performance Analyzer is a powerful tool that helps developers optimize
their applications for high performance, power efficiency, and scalability on Intel
hardware. By providing detailed insights into system and application performance, it
helps developers identify and optimize performance bottlenecks, achieve high
application performance, and optimize power consumption.

Er. No. 210303105236


Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

PRACTICAL-08

AIM: - Write a program to perform load distribution on GPU using CUDA.

THEORY:-
What is Load Balancing?
Load balancing refers to the process of distributing workloads evenly across multiple
computing resources, such as processors, nodes, or clusters, to ensure efficient utilization of
resources and to minimize processing time.

Why Load Balancing is important:


Load balancing is essential because it optimizes the use of computational resources, ensuring
that workloads are evenly distributed across available systems. This prevents some resources
from becoming overburdened while others remain idle, which in turn maximizes efficiency
and reduces processing time. By balancing the load, systems can handle more tasks
simultaneously, improving overall performance and throughput. Additionally, load balancing
supports scalability, allowing systems to maintain performance as the number of tasks or data
size increases. It also enhances reliability and fault tolerance by redistributing tasks from
failing or overloaded components, ensuring continuous operation and minimizing downtime.

What are the different load balancing techniques:


 There are several types of load balancing techniques. Some of them are:
 Static
 Dynamic
 Static technique is further divided into 3 types:
 Based on Data Partitioning:
o Array – Simple & Block Array
o Cyclic & Block
o Randomized
 Based on Task Partitioning
 Hierarchical
 Dynamic technique is further divided into 2 types:
 Centralized
o Master & Slave Technique
 Distributed

Er. No. 210303105236


Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

Advantages of Load Balancing:


 Optimized Resource Utilization: Load balancing ensures that all available resources
are used efficiently, preventing any single resource from becoming a bottleneck and
enhancing overall system performance.
 Reduced Latency: By distributing workloads evenly, load balancing minimizes
processing delays, leading to faster response times and quicker task completion.
 Adaptability: Dynamic load balancing techniques can adjust to changes in workload or
resource availability, making it easier to scale up or down as needed.
 Cost Efficiency: By maximizing resource utilization, load balancing helps in reducing
operational costs, as fewer resources are wasted and the need for additional hardware is
minimized.

CODE:

import time as tm
import matplotlib.pyplot as plt

nodes=[1,2,3,4,5,6,7,8,9,10]
startingtime = []
endingtime = []
executiontime = []

for i in range(10):
startingtime.append(tm.time())

for i in range(10):
tm.sleep(0.1)
endingtime.append(tm.time())

for i in range(10):
executiontime.append(endingtime[i] - startingtime[i])

for i in range(10):
print(f"Node: {nodes[i]}, Execution Time: {executiontime[i]}")

fig = plt.figure(figsize=(5,5))
plt.plot(nodes, executiontime, marker = 'o')
plt.xlabel("Nodes")
plt.ylabel("Execution Time (seconds)")

Er. No. 210303105236


Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

OUTPUT:

Er. No. 210303105236


Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

PRACTICAL-04

AIM: - Write a program on an unloaded cluster for several different


numbers of nodes and record the time taken in each case.
THEORY:
What is Cluster:
A cluster is a group of interconnected computers, or nodes, that work together to perform
complex computations efficiently. Each node in the cluster has its own processor, memory,
and storage, and they are connected through a high-speed network. By dividing tasks among
the nodes, clusters enable parallel processing, which significantly speeds up computations.
They are scalable, allowing more nodes to be added for greater computational power, and are
managed by specialized software that allocates resources efficiently. Clusters are commonly
used in research, simulations, and data-intensive applications where high processing power is
essential.

What are the types of cluster:


1. High-Performance Computing (HPC) Clusters: Designed for executing complex
calculations and simulations by using parallel processing across multiple nodes,
commonly used in scientific research and data-intensive tasks.
2. High-Availability (HA) Clusters: Ensure continuous service by automatically
switching to a backup node in case of failure, minimizing downtime and maintaining
system reliability.
3. Load-Balancing Clusters: Distribute incoming workloads across multiple servers to
optimize resource use and improve response times, commonly used in web services.
4. Storage Clusters: Manage large datasets by distributing data across multiple storage
nodes, providing redundancy and high availability, often used in distributed file
systems.
5. Grid Computing Clusters: Combine resources from multiple, often geographically
dispersed, computers to solve a single problem or task, functioning like a virtual
supercomputer.
6. Big Data Clusters: Designed for processing and analyzing massive datasets, utilizing
distributed computing frameworks like Hadoop or Spark for parallel data processing.
7. Beowulf Clusters: A type of HPC cluster built from standard, off-the-shelf hardware
and open-source software, offering a cost-effective solution for research and
educational purposes.

Er. No. 210303105236


Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

CODE:

from numba import cuda


import numpy as np

@cuda.jit
def add_arrays(a, b, res):
index = cuda.grid(1)
if index < a.size:
res[index] = a[index] + b[index]

# Define input arrays


a = np.array([11, 21, 13, 14, 15], dtype=np.float32)
b = np.array([10, 20, 30, 40, 50], dtype=np.float32)
result = np.empty_like(a)

# Allocate device memory


d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_result = cuda.to_device(result)

# Define block and grid sizes


threads_per_block = 256
blocks_per_grid = (a.size + threads_per_block - 1) // threads_per_block

# Launch the kernel


add_arrays[blocks_per_grid, threads_per_block](d_a, d_b, d_result)

# Copy result from device to host


d_result.copy_to_host(result)

print(result)

OUTPUT:

Er. No. 210303105236


Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

PRACTICAL-05
AIM: Write a program to check task distribution using Gprof.

THEORY:

What is Profiler:

Profiler can be used to understand code from a timing point of view, with the

objective of optimizing it to handle various runtime conditions or various loads.

Profiling results can be ingested by a compiler that provides profile-guided

optimization.
What is Gprof:

Gprof is a performance analysis tool used to profile applications to determine where

time is spent during programexecution.


Features of Gprof:

 Customize the output format and style


 Specify how Gprof analyzes its data
 Specify debugging/diagnostic output while Gprof performs its work
Steps to perform Gprof:

1. Compile with Profiling Enabled:


 During compilation, add the -pg flag to your compiler command. This
tells the compiler to includeprofiling information in the executable.
For example, if you're using GCC to compile a C

program named source.c, you'd use:gcc -pg source.c -

o my_program
2. Run the Program:

 Execute your program as you normally would. This will generate profiling
data and store it in a file namedgmon.out (by default) in the current working
directory.

3. Analyze with gprof:

Er. No. 210303105236


Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

 Once your program finishes running, use the gprof command followed by
the name of your program toanalyze the profiling data in gmon.out.

For instance : gprof my_program

This will generate a report containing two main sections:

 Flat profile: This shows the total time spent in each function of your program.
 Call graph: This illustrates how functions call each other and how much time
is spent in each call.

Code (Simple):

Er. No. 210303105236


Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

Output:

Er. No. 210303105236


Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

Code (Complex):

Er. No. 210303105236


Faculty of Engineering & Technology
High Performance Computing Laboratory
(203105430)
B. Tech CSE 4th Year 7th Semester

Output:

Er. No. 210303105236

You might also like