0% found this document useful (0 votes)
23 views

Lab 1 Intro To High Performance Computing

This 3-sentence summary provides the key details about the document: The document reports on the results of a lab exploring parallel computation with GPUs, finding speedups of around 500x for a shift cipher, 385x for force evaluation, and analyzing the execution time and memory bandwidth utilization of a graph propagation kernel, concluding the kernel uses only a fraction of the GPU's peak bandwidth.

Uploaded by

Phil Jones
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Lab 1 Intro To High Performance Computing

This 3-sentence summary provides the key details about the document: The document reports on the results of a lab exploring parallel computation with GPUs, finding speedups of around 500x for a shift cipher, 385x for force evaluation, and analyzing the execution time and memory bandwidth utilization of a graph propagation kernel, concluding the kernel uses only a fraction of the GPU's peak bandwidth.

Uploaded by

Phil Jones
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

High Performance Computing

Lab 1 Report
Contents

1. Introduction.......................................................................................................................................2
2. Method................................................................................................................................................2
3. Results and Discussion......................................................................................................................2
Appendix....................................................................................................................................................7
mp1-part1.cu..............................................................................................................................................7
mp1-part2.cu..............................................................................................................................................8
mp1-part3.cu............................................................................................................................................10

1
1.Introduction
In this lab, we explored the process of performing parallel computation with GPUs.
This process includes data transfer from the CPU to the GPU, transfer of GPU
results back to the CPU and the fundamentals of setting up kernels as the means of
performing computations on the GPU. The outcomes of the lab show the great
benefits of implementing parallel code on the GPU as seen in the speed-ups
obtained, compared to the sequential versions on the CPU, even under less than
ideal conditions for parallel operations on the GPU.

2.Method
The results of the lab were obtained by remotely operating on a cluster of 4
compute nodes each equipped with NVIDIA RTX 2080Ti GPUs.

3.Results and Discussion


In mp1-part 1, we perform the parallel implementation of a shift cypher. The
results of the parallel operation are shown in Figure 1 below.

Figure 1: Result of Shift Cypher in mp1-part1


In the instance that produced the result above, we see approximately 500× speedup
of the shift cypher implementation on the GPU.

2
In order to derive a cost (execution time) model for mp1-part1, the code was
modified to run 20 iterations for a particular number of elements and the average
times for device to host transfers, host to device transfers, GPU kernel time and
host shift cypher time were calculated and used as a representation of the GPU and
host times for that particular number of elements. Table 1 below shows the results
obtained from the iterative procedure just described (The cudaMalloc operations
are assumed to be fixed at 10ms for all data sizes, since there are two cudaMalloc
operations)
Data Size t host (ms) t HtoD (ms) t DtoH (ms) t kernel(ms) t malloc(ms)
(num_elements)
23
2 56.43 7.47 9.35 0.15 10
24
2 112.14 15.72 18.72 0.27 10
25
2 222.91 31.71 38.35 0.53 10
26
2 447.85 61.26 74.64 1.03 10
27
2 889.82 123.82 153.4 2.04 10
28
2 1773.25 253.5 298.68 4.06 10
Table 1: Timing results obtained by averaging 20 iterations for a fixed data size
The GPU execution time t device can be obtained as:
t device =t HtoD +t DtoH +t kernel +t malloc

where t HtoD is the average host to device transfer time, t DtoH is the average device to
host transfer time, t kernel is the average GPU kernel time and t malloc is the fixed time
for cudaMalloc operations. The results obtained in Table 1 are shown in Figure 2
below, which clearly shows the linearity of the execution time for both the GPU
and the CPU.

3
Figure 2: Execution time plots for GPU and CPU
Calculating the gradients of both lines from the data in Table 1, the execution times
can be modelled as:
t device =( 2.0737 ×10−9 ) . numelements + ( 10 ×10−3 )

t host =( 6.6020 ×10−9 ) . num elements + ( 1.05 ×10−3 )

Therefore, the break-even point can be obtained as the number of elements for
which t device = t host . Using the equations derived above, this is achieved for
( 1.05 ×10−3) −( 10 ×10−3 )
≈ 1.98 million elements
( 2.0737 ×10−9 )−( 6.6020 ×10−9 )

For mp1-part2, we observe the potential for control flow divergence in the
force_eval kernel due to the conditional statements in the kernel, as shown below:

4
__global__ void force_eval(
    float4* set_A,
    float4* set_B,
    int* indices,
    float4* force_vectors,
    int array_length)
{
    // TODO your code here ...
    int i = blockIdx.x*blockDim.x+threadIdx.x;
    if(i<array_length)
    {
        if (indices[i] < array_length && indices[i] >= 0)
        {
             force_vectors[i] = force_calc(set_A[i], set_B[indices[i]]);
        }
        else
        {
            force_vectors[i] = make_float4(0.0, 0.0, 0.0, 0.0);
        }
    }

Nevertheless, the timing results shown in Figure 3 show a kernel speedup of


approximately 385× over the host function.

Figure 3: Result of Force Evaluation in mp1-part2

5
The device_graph_propagate kernel in mp1-part 3 is shown below:
__global__ void device_graph_propagate(
    unsigned int *d_graph_indices,
    unsigned int *d_graph_edges,
    float *d_graph_nodes_in,
    float *d_graph_nodes_out,
    float *d_inv_edges_per_node,
    int array_length)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < array_length)
    {
        float sum = 0.f;
        for (int j = d_graph_indices[i]; j < d_graph_indices[i + 1]; j++)
        {
            sum += d_graph_nodes_in[d_graph_edges[j]] *
d_inv_edges_per_node[d_graph_edges[j]];
        }
        d_graph_nodes_out[i] = 0.5f / (float)array_length + 0.5f * sum;
    }
}

From the kernel we can see that, assuming avg_edges number of links, loads are
performed by d_graph_nodes_in[], d_graph_edges[j](twice) and
d_inv_edges_per_node[] while the stores are done by d_graph_nodes_out[i].
Therefore, noting that there are 20 iterations per node, we have 20*4*avg_edges
number of loads and 20*1*avg_edges number of stores per node.
From figure 4 below, the execution time of the kernel is given as 166ms. We also
expect num_elements number of active threads per kernel. The average number of
bytes read per node will be 20*avg_edges*(2*sizeof(float)+2*sizeof(unsigned int))
= 20*8*(2*4bytes+2*4bytes) = 2560 bytes. The average number of bytes written
per node will be 20*avg_edges*(sizeof(float)) = 20*8*4bytes = 640bytes.
The effective bandwidth of the kernel can be estimated as:
( avg byte s read+ written ) ×nnodes
Effective bandwidth=
execution time
( 2560+640 ) bytes× 221
Effective bandwidth= =40.427 GB / s
166 ms

6
The peak bandwidth for RTX 2080Ti is 616GB/s. Therefore, compared to the
kernel bandwidth of 40.427GB/s, the kernel is using only a fraction of the peak
bandwidth of the GPU.

Figure 4: Result of Page Ranking in mp1-part3

You might also like