Tutorial 4
Tutorial 4
Tutorial 4:
Hybrid Parallel Programming Models
TA: Hucheng Liu ([email protected])
Contents
• Part 1: MPI + OpenMP
• UVA: One address space for all CPU and GPU memory
• Determine physical memory location from a pointer value
• Enable libraries to simplify their interfaces (e.g. MPI and cudaMemcpy)
• Supported on devices with compute capability 2.0
UVA Data Exchange with MPI
Example: Matrix Multiplication
• The root process generates two random matrices of input size and
stores them in a 1-D array in Row-major order.
• The first matrix is divided into columns depending on the number of
input processors and each part is sent to a separate GPU
(MPI_Scatter)
• The second matrix (Matrix B) is broadcasted to all nodes and copied
on all GPUs to perform computation. (MPI_Bcast)
• Each GPU computes its own part of the result matrix and sends the
result back to the root process
• Results are gathered into a resultant matrix. (MPI_Gather)
Code
• Without UVA. Send the data in the host memory.
• matvec.cu
https://www.open-mpi.org/faq/?category=runcuda
ompi_info --parsable --all | grep mpi_built_with_cuda_support:value