Lab 1 Intro To High Performance Computing

This 3-sentence summary provides the key details about the document: The document reports on the results of a lab exploring parallel computation with GPUs, finding speedups of around 500x for a shift cipher, 385x for force evaluation, and analyzing the execution time and memory bandwidth utilization of a graph propagation kernel, concluding the kernel uses only a fraction of the GPU's peak bandwidth.

Uploaded by

Phil Jones

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views

Lab 1 Intro To High Performance Computing

Uploaded by

Phil Jones

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

High Performance Computing

Lab 1 Report
Contents

1. Introduction.......................................................................................................................................2
2. Method................................................................................................................................................2
3. Results and Discussion......................................................................................................................2
Appendix....................................................................................................................................................7
mp1-part1.cu..............................................................................................................................................7
mp1-part2.cu..............................................................................................................................................8
mp1-part3.cu............................................................................................................................................10

1
1.Introduction
In this lab, we explored the process of performing parallel computation with GPUs.
This process includes data transfer from the CPU to the GPU, transfer of GPU
results back to the CPU and the fundamentals of setting up kernels as the means of
performing computations on the GPU. The outcomes of the lab show the great
benefits of implementing parallel code on the GPU as seen in the speed-ups
obtained, compared to the sequential versions on the CPU, even under less than
ideal conditions for parallel operations on the GPU.

2.Method
The results of the lab were obtained by remotely operating on a cluster of 4
compute nodes each equipped with NVIDIA RTX 2080Ti GPUs.

3.Results and Discussion

In mp1-part 1, we perform the parallel implementation of a shift cypher. The
results of the parallel operation are shown in Figure 1 below.

Figure 1: Result of Shift Cypher in mp1-part1

In the instance that produced the result above, we see approximately 500× speedup
of the shift cypher implementation on the GPU.

2
In order to derive a cost (execution time) model for mp1-part1, the code was
modified to run 20 iterations for a particular number of elements and the average
times for device to host transfers, host to device transfers, GPU kernel time and
host shift cypher time were calculated and used as a representation of the GPU and
host times for that particular number of elements. Table 1 below shows the results
obtained from the iterative procedure just described (The cudaMalloc operations
are assumed to be fixed at 10ms for all data sizes, since there are two cudaMalloc
operations)
Data Size t host (ms) t HtoD (ms) t DtoH (ms) t kernel(ms) t malloc(ms)
(num_elements)
23
2 56.43 7.47 9.35 0.15 10
24
2 112.14 15.72 18.72 0.27 10
25
2 222.91 31.71 38.35 0.53 10
26
2 447.85 61.26 74.64 1.03 10
27
2 889.82 123.82 153.4 2.04 10
28
2 1773.25 253.5 298.68 4.06 10
Table 1: Timing results obtained by averaging 20 iterations for a fixed data size
The GPU execution time t device can be obtained as:
t device =t HtoD +t DtoH +t kernel +t malloc

where t HtoD is the average host to device transfer time, t DtoH is the average device to
host transfer time, t kernel is the average GPU kernel time and t malloc is the fixed time
for cudaMalloc operations. The results obtained in Table 1 are shown in Figure 2
below, which clearly shows the linearity of the execution time for both the GPU
and the CPU.

3
Figure 2: Execution time plots for GPU and CPU
Calculating the gradients of both lines from the data in Table 1, the execution times
can be modelled as:
t device =( 2.0737 ×10−9 ) . numelements + ( 10 ×10−3 )

t host =( 6.6020 ×10−9 ) . num elements + ( 1.05 ×10−3 )

Therefore, the break-even point can be obtained as the number of elements for
which t device = t host . Using the equations derived above, this is achieved for
( 1.05 ×10−3) −( 10 ×10−3 )
≈ 1.98 million elements
( 2.0737 ×10−9 )−( 6.6020 ×10−9 )

For mp1-part2, we observe the potential for control flow divergence in the
force_eval kernel due to the conditional statements in the kernel, as shown below:

4
__global__ void force_eval(
float4* set_A,
float4* set_B,
int* indices,
float4* force_vectors,
int array_length)
{
// TODO your code here ...
int i = blockIdx.x*blockDim.x+threadIdx.x;
if(i<array_length)
{
if (indices[i] < array_length && indices[i] >= 0)
{
force_vectors[i] = force_calc(set_A[i], set_B[indices[i]]);
}
else
{
force_vectors[i] = make_float4(0.0, 0.0, 0.0, 0.0);
}
}

Nevertheless, the timing results shown in Figure 3 show a kernel speedup of

approximately 385× over the host function.

Figure 3: Result of Force Evaluation in mp1-part2

5
The device_graph_propagate kernel in mp1-part 3 is shown below:
__global__ void device_graph_propagate(
unsigned int *d_graph_indices,
unsigned int *d_graph_edges,
float *d_graph_nodes_in,
float *d_graph_nodes_out,
float *d_inv_edges_per_node,
int array_length)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < array_length)
{
float sum = 0.f;
for (int j = d_graph_indices[i]; j < d_graph_indices[i + 1]; j++)
{
sum += d_graph_nodes_in[d_graph_edges[j]] *
d_inv_edges_per_node[d_graph_edges[j]];
}
d_graph_nodes_out[i] = 0.5f / (float)array_length + 0.5f * sum;
}
}

From the kernel we can see that, assuming avg_edges number of links, loads are
performed by d_graph_nodes_in[], d_graph_edges[j](twice) and
d_inv_edges_per_node[] while the stores are done by d_graph_nodes_out[i].
Therefore, noting that there are 20 iterations per node, we have 20*4*avg_edges
number of loads and 20*1*avg_edges number of stores per node.
From figure 4 below, the execution time of the kernel is given as 166ms. We also
expect num_elements number of active threads per kernel. The average number of
bytes read per node will be 20*avg_edges*(2*sizeof(float)+2*sizeof(unsigned int))
= 20*8*(2*4bytes+2*4bytes) = 2560 bytes. The average number of bytes written
per node will be 20*avg_edges*(sizeof(float)) = 20*8*4bytes = 640bytes.
The effective bandwidth of the kernel can be estimated as:
( avg byte s read+ written ) ×nnodes
Effective bandwidth=
execution time
( 2560+640 ) bytes× 221
Effective bandwidth= =40.427 GB / s
166 ms

6
The peak bandwidth for RTX 2080Ti is 616GB/s. Therefore, compared to the
kernel bandwidth of 40.427GB/s, the kernel is using only a fraction of the peak
bandwidth of the GPU.

Figure 4: Result of Page Ranking in mp1-part3

Industrial Communication Systems
No ratings yet
Industrial Communication Systems
339 pages
AcceleratingAIAdvancements Pre Print Doube Blind
No ratings yet
AcceleratingAIAdvancements Pre Print Doube Blind
9 pages
G80 Cuda
No ratings yet
G80 Cuda
25 pages
Bandwidth Intensive 3-D FFT Kernel For Gpus Using Cuda: Akira Nukada, Yasuhiko Ogata, Toshio Endo, Satoshi Matsuoka
No ratings yet
Bandwidth Intensive 3-D FFT Kernel For Gpus Using Cuda: Akira Nukada, Yasuhiko Ogata, Toshio Endo, Satoshi Matsuoka
11 pages
CUDA-OPENCL
No ratings yet
CUDA-OPENCL
17 pages
Chap9_CUDA Optimization
No ratings yet
Chap9_CUDA Optimization
73 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Cufft Performance Graphs
No ratings yet
Cufft Performance Graphs
10 pages
cs239 Ejer1
No ratings yet
cs239 Ejer1
2 pages
cs179 2016 Lec13
No ratings yet
cs179 2016 Lec13
30 pages
Accelerating Large Graph Algorithms On The GPU Using Cuda
No ratings yet
Accelerating Large Graph Algorithms On The GPU Using Cuda
12 pages
Gujarat Technological University: W.E.F. AY 2018-19
No ratings yet
Gujarat Technological University: W.E.F. AY 2018-19
3 pages
Gpu, Cuda and Pycuda
No ratings yet
Gpu, Cuda and Pycuda
11 pages
Benchmarking The NVIDIA 8800GTX With The CUDA Development Platform
No ratings yet
Benchmarking The NVIDIA 8800GTX With The CUDA Development Platform
2 pages
GPU Model: Cedric Nugteren February 2, 2010
No ratings yet
GPU Model: Cedric Nugteren February 2, 2010
8 pages
cuda_mode_lecture2
No ratings yet
cuda_mode_lecture2
33 pages
Accelerating Large Graph Algorithms On The GPU Using CUDA
No ratings yet
Accelerating Large Graph Algorithms On The GPU Using CUDA
12 pages
GPGPU Tutorial
No ratings yet
GPGPU Tutorial
155 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
How CUDA Programming Works - 1647539841016001sz6e
No ratings yet
How CUDA Programming Works - 1647539841016001sz6e
101 pages
High Performance Pattern Recognition On GPU
No ratings yet
High Performance Pattern Recognition On GPU
6 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Pawan 09 Graph Algorithms
No ratings yet
Pawan 09 Graph Algorithms
26 pages
Zhongliang Chen Thesis
No ratings yet
Zhongliang Chen Thesis
71 pages
1
No ratings yet
1
44 pages
GPU in Supercomputer
No ratings yet
GPU in Supercomputer
7 pages
Performance (Memory) Optimization: National Tsing-Hua University 2017, Summer Semester
No ratings yet
Performance (Memory) Optimization: National Tsing-Hua University 2017, Summer Semester
77 pages
An Introduction To PyCUDA Using Prefix Sum Algorithm PDF
No ratings yet
An Introduction To PyCUDA Using Prefix Sum Algorithm PDF
6 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
Demystifying GPU microarchitecture through microbenchmarking
No ratings yet
Demystifying GPU microarchitecture through microbenchmarking
12 pages
s6492 Scott Le Grand Deterministic Machine Learning Molecular Dynamics
No ratings yet
s6492 Scott Le Grand Deterministic Machine Learning Molecular Dynamics
68 pages
Lecture17 12
No ratings yet
Lecture17 12
86 pages
Luong Thesis
No ratings yet
Luong Thesis
81 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Sum Product Paper
No ratings yet
Sum Product Paper
10 pages
Barnett Haskins
No ratings yet
Barnett Haskins
29 pages
Aca Lab Manual Final
No ratings yet
Aca Lab Manual Final
28 pages
L06_GPGPU_CUDA_Programming_1
No ratings yet
L06_GPGPU_CUDA_Programming_1
23 pages
Ejercicio 2 Práctica 3: CUDA Desempeño en Función de La Homogeneidad para Acceder A Memoria y de La Regularidad Del Código
No ratings yet
Ejercicio 2 Práctica 3: CUDA Desempeño en Función de La Homogeneidad para Acceder A Memoria y de La Regularidad Del Código
8 pages
Overview of GPGPU's
No ratings yet
Overview of GPGPU's
81 pages
278 hw5
No ratings yet
278 hw5
20 pages
WRF-GPU DR Young-Tae+Kim
No ratings yet
WRF-GPU DR Young-Tae+Kim
22 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
The Design and Implementation of A Verification Technique For GPU Kernels
No ratings yet
The Design and Implementation of A Verification Technique For GPU Kernels
49 pages
Cuda Program + Wait For User Input
No ratings yet
Cuda Program + Wait For User Input
2 pages
LP 1,,1
No ratings yet
LP 1,,1
5 pages
2023 CSC14120 Lecture05 CUDAMemories
No ratings yet
2023 CSC14120 Lecture05 CUDAMemories
48 pages
GPU_Assignment-3_Solution
No ratings yet
GPU_Assignment-3_Solution
4 pages
2023-CSC14120-Lecture01-CUDAIntroduction
No ratings yet
2023-CSC14120-Lecture01-CUDAIntroduction
32 pages
Dynamic Load Balancing On Single-And Multi-GPU Systems
No ratings yet
Dynamic Load Balancing On Single-And Multi-GPU Systems
12 pages
Neu m041r233x
No ratings yet
Neu m041r233x
70 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
chapter-8
No ratings yet
chapter-8
58 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
Accelerating Data Parallelism in Gpus Through Apgas
No ratings yet
Accelerating Data Parallelism in Gpus Through Apgas
9 pages
PDC assignment
No ratings yet
PDC assignment
9 pages
How To Optimize A CUDA Matmul Kernel For CuBLAS-like Performance - A Worklog
No ratings yet
How To Optimize A CUDA Matmul Kernel For CuBLAS-like Performance - A Worklog
23 pages
Image Processing With CUDA
No ratings yet
Image Processing With CUDA
66 pages
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
No ratings yet
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
3 pages
Profound Python Libraries
From Everand
Profound Python Libraries
Onder Teker
No ratings yet
FPGA-Based Farsi Handwritten Digit Recognition System
No ratings yet
FPGA-Based Farsi Handwritten Digit Recognition System
7 pages
notification-earlier-practical-ese-win-24 (1)
No ratings yet
notification-earlier-practical-ese-win-24 (1)
2 pages
10 - Introduction and Overview GPGPU
No ratings yet
10 - Introduction and Overview GPGPU
69 pages
Creo 7.0 Read This First
No ratings yet
Creo 7.0 Read This First
9 pages
Tìm hiểu GAE 01 - Building.High-Perf
No ratings yet
Tìm hiểu GAE 01 - Building.High-Perf
41 pages
Listado de Cuentas Federales
No ratings yet
Listado de Cuentas Federales
8 pages
Introduction To HLASM Questions and Answers
No ratings yet
Introduction To HLASM Questions and Answers
8 pages
Dap An
No ratings yet
Dap An
19 pages
Grecon Bs 7 r08 en Web
No ratings yet
Grecon Bs 7 r08 en Web
16 pages
IoT UNit 3 IPU
No ratings yet
IoT UNit 3 IPU
116 pages
It Support Contract
No ratings yet
It Support Contract
8 pages
CV - Ahmet Çabakci
No ratings yet
CV - Ahmet Çabakci
2 pages
Datasheet Weighing Kit Iw B Fs sl60
No ratings yet
Datasheet Weighing Kit Iw B Fs sl60
3 pages
1.2-Difference Between Operational and Informational Systems
No ratings yet
1.2-Difference Between Operational and Informational Systems
6 pages
Software Engineering
No ratings yet
Software Engineering
10 pages
Mediaencoder Reference
No ratings yet
Mediaencoder Reference
51 pages
Assignment No: 3: Aim: Objective: Theory:-Inverted Index
No ratings yet
Assignment No: 3: Aim: Objective: Theory:-Inverted Index
2 pages
Intergration Application - 2
No ratings yet
Intergration Application - 2
16 pages
(K-ROSET) - (How To Create Paint Conveyor Synchronization Project) - (E)
No ratings yet
(K-ROSET) - (How To Create Paint Conveyor Synchronization Project) - (E)
96 pages
Safety Manual: Eagle Quantum Premier SIL 2 Rated Fire & Gas System
No ratings yet
Safety Manual: Eagle Quantum Premier SIL 2 Rated Fire & Gas System
29 pages
Static or Embedded and Dynamic or Interactive SQL
No ratings yet
Static or Embedded and Dynamic or Interactive SQL
5 pages
DBMS Mod 4
No ratings yet
DBMS Mod 4
139 pages
Heuristics
No ratings yet
Heuristics
2 pages
Demo-ALL-Odisha-Computer-Previous-Year-Question-Explanation-4014-PYQ-By-Techofworld.In-
No ratings yet
Demo-ALL-Odisha-Computer-Previous-Year-Question-Explanation-4014-PYQ-By-Techofworld.In-
20 pages
E-Mail Writing Exercises (1)
No ratings yet
E-Mail Writing Exercises (1)
8 pages
Xy Xy y X: Solve Cauchy-Euler Type Differential Equation - Solution: Method 1: (A) Let
No ratings yet
Xy Xy y X: Solve Cauchy-Euler Type Differential Equation - Solution: Method 1: (A) Let
5 pages
ccs-352-multimedia-and-animation-question-bank-unitwise
No ratings yet
ccs-352-multimedia-and-animation-question-bank-unitwise
27 pages
Austria Master Program List - Updated
No ratings yet
Austria Master Program List - Updated
14 pages
Weighing Indicator NPR
No ratings yet
Weighing Indicator NPR
2 pages

Lab 1 Intro To High Performance Computing

Uploaded by

Lab 1 Intro To High Performance Computing

Uploaded by

High Performance Computing

3.Results and Discussion

Figure 1: Result of Shift Cypher in mp1-part1

t host =( 6.6020 ×10−9 ) . num elements + ( 1.05 ×10−3 )

Nevertheless, the timing results shown in Figure 3 show a kernel speedup of

Figure 3: Result of Force Evaluation in mp1-part2

Figure 4: Result of Page Ranking in mp1-part3

You might also like