Lab 1 Intro To High Performance Computing
Lab 1 Intro To High Performance Computing
Lab 1 Report
Contents
1. Introduction.......................................................................................................................................2
2. Method................................................................................................................................................2
3. Results and Discussion......................................................................................................................2
Appendix....................................................................................................................................................7
mp1-part1.cu..............................................................................................................................................7
mp1-part2.cu..............................................................................................................................................8
mp1-part3.cu............................................................................................................................................10
1
1.Introduction
In this lab, we explored the process of performing parallel computation with GPUs.
This process includes data transfer from the CPU to the GPU, transfer of GPU
results back to the CPU and the fundamentals of setting up kernels as the means of
performing computations on the GPU. The outcomes of the lab show the great
benefits of implementing parallel code on the GPU as seen in the speed-ups
obtained, compared to the sequential versions on the CPU, even under less than
ideal conditions for parallel operations on the GPU.
2.Method
The results of the lab were obtained by remotely operating on a cluster of 4
compute nodes each equipped with NVIDIA RTX 2080Ti GPUs.
2
In order to derive a cost (execution time) model for mp1-part1, the code was
modified to run 20 iterations for a particular number of elements and the average
times for device to host transfers, host to device transfers, GPU kernel time and
host shift cypher time were calculated and used as a representation of the GPU and
host times for that particular number of elements. Table 1 below shows the results
obtained from the iterative procedure just described (The cudaMalloc operations
are assumed to be fixed at 10ms for all data sizes, since there are two cudaMalloc
operations)
Data Size t host (ms) t HtoD (ms) t DtoH (ms) t kernel(ms) t malloc(ms)
(num_elements)
23
2 56.43 7.47 9.35 0.15 10
24
2 112.14 15.72 18.72 0.27 10
25
2 222.91 31.71 38.35 0.53 10
26
2 447.85 61.26 74.64 1.03 10
27
2 889.82 123.82 153.4 2.04 10
28
2 1773.25 253.5 298.68 4.06 10
Table 1: Timing results obtained by averaging 20 iterations for a fixed data size
The GPU execution time t device can be obtained as:
t device =t HtoD +t DtoH +t kernel +t malloc
where t HtoD is the average host to device transfer time, t DtoH is the average device to
host transfer time, t kernel is the average GPU kernel time and t malloc is the fixed time
for cudaMalloc operations. The results obtained in Table 1 are shown in Figure 2
below, which clearly shows the linearity of the execution time for both the GPU
and the CPU.
3
Figure 2: Execution time plots for GPU and CPU
Calculating the gradients of both lines from the data in Table 1, the execution times
can be modelled as:
t device =( 2.0737 ×10−9 ) . numelements + ( 10 ×10−3 )
Therefore, the break-even point can be obtained as the number of elements for
which t device = t host . Using the equations derived above, this is achieved for
( 1.05 ×10−3) −( 10 ×10−3 )
≈ 1.98 million elements
( 2.0737 ×10−9 )−( 6.6020 ×10−9 )
For mp1-part2, we observe the potential for control flow divergence in the
force_eval kernel due to the conditional statements in the kernel, as shown below:
4
__global__ void force_eval(
float4* set_A,
float4* set_B,
int* indices,
float4* force_vectors,
int array_length)
{
// TODO your code here ...
int i = blockIdx.x*blockDim.x+threadIdx.x;
if(i<array_length)
{
if (indices[i] < array_length && indices[i] >= 0)
{
force_vectors[i] = force_calc(set_A[i], set_B[indices[i]]);
}
else
{
force_vectors[i] = make_float4(0.0, 0.0, 0.0, 0.0);
}
}
5
The device_graph_propagate kernel in mp1-part 3 is shown below:
__global__ void device_graph_propagate(
unsigned int *d_graph_indices,
unsigned int *d_graph_edges,
float *d_graph_nodes_in,
float *d_graph_nodes_out,
float *d_inv_edges_per_node,
int array_length)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < array_length)
{
float sum = 0.f;
for (int j = d_graph_indices[i]; j < d_graph_indices[i + 1]; j++)
{
sum += d_graph_nodes_in[d_graph_edges[j]] *
d_inv_edges_per_node[d_graph_edges[j]];
}
d_graph_nodes_out[i] = 0.5f / (float)array_length + 0.5f * sum;
}
}
From the kernel we can see that, assuming avg_edges number of links, loads are
performed by d_graph_nodes_in[], d_graph_edges[j](twice) and
d_inv_edges_per_node[] while the stores are done by d_graph_nodes_out[i].
Therefore, noting that there are 20 iterations per node, we have 20*4*avg_edges
number of loads and 20*1*avg_edges number of stores per node.
From figure 4 below, the execution time of the kernel is given as 166ms. We also
expect num_elements number of active threads per kernel. The average number of
bytes read per node will be 20*avg_edges*(2*sizeof(float)+2*sizeof(unsigned int))
= 20*8*(2*4bytes+2*4bytes) = 2560 bytes. The average number of bytes written
per node will be 20*avg_edges*(sizeof(float)) = 20*8*4bytes = 640bytes.
The effective bandwidth of the kernel can be estimated as:
( avg byte s read+ written ) ×nnodes
Effective bandwidth=
execution time
( 2560+640 ) bytes× 221
Effective bandwidth= =40.427 GB / s
166 ms
6
The peak bandwidth for RTX 2080Ti is 616GB/s. Therefore, compared to the
kernel bandwidth of 40.427GB/s, the kernel is using only a fraction of the peak
bandwidth of the GPU.