0% found this document useful (0 votes)
12 views

Tutorial 4

Uploaded by

1378311976dcr
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Tutorial 4

Uploaded by

1378311976dcr
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

COMP 4007: Parallel Processing and Computer Architecture

Tutorial 4:
Hybrid Parallel Programming Models
TA: Hucheng Liu ([email protected])
Contents
• Part 1: MPI + OpenMP

• Part 2: MPI + CUDA


Part 1: MPI + OpenMP
MPI+ OpenMP: Motivation
• Two-level Parallelization Node0 Node0
• Mimics hardware layout of cluster Process Process
MPI
• MPI between nodes or CPU sockets T0 T1 T0 T1
• OpenMP within shared-memory nodes or processors T2 T3 T2 T3

• Pros OpenMP OpenMP

• No message passing inside of the shared-memory MPI MPI


processor (SMP) nodes Node0 Node0
• No topology problem Process Process
T0 T1 T0 T1
• Cons MPI
T2 T3 T2 T3
• Should be careful with sleeping threads
OpenMP OpenMP
• Not always better than pure MPI or OpenMP
MPI Rules with OpenMP
• Special MPI init for multi-threaded MPI processes:
int MPI_Init_thread( int* argc, char** argv[],
int thread_level_required,
int* thead_level_provided);
int MPI_Query_thread( int* thread_level_provided);
int MPI_Is_main_thread(int* flag);

• thread_level_required specifies the requested level of thread support.


• Actual level of support is then returned into thread_level_provided.
Four Options for Thread Support
• MPI_THREAD_SINGLE
• Only one thread will execute, EQUALS to MPI_Init
• MPI_THREAD_FUNNELED
• Only master thread will make MPI-calls
• MPI_THREAD_SERIALIZED
• Multiple threads may make MPI-calls, but only one at a time
• MPI_THREAD_MULTIPLE
• Multiple threads may call MPI with no restrictions
• In most cases MPI_THREAD_FUNNELED provides the best choice for
hybrid programs
Hybrid Hello
• mpi_omp_hello.c
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name(processor_name, &namelen);

#pragma omp parallel default(shared) private(iam, np)


{
np = omp_get_num_threads();
iam = omp_get_thread_num();
printf("Hybrid: Hello from thread %d out of %d from process %d out of
%d on %s\n", iam, np, rank, numprocs, processor_name);
}
Hybrid Array Sum: Funneled MPI calls
• mpi_omp_SumArray.c: Process 0
#pragma omp parallel
{
if (pid == 0) {

#pragma omp master
{
for (int i = 1; i < np; i++) {
MPI_Send(&elements_per_process, …);
MPI_Send(&a[i * elements_per_process…);
}
}
#pragma omp barrier
#pragma omp for reduction(+:local_sum)
for (int i = 0; i < elements_per_process; i++)
local_sum += a[i];
}

}
Hybrid Array Sum
• mpi_omp_SumArray.c: Other Processes
#pragma omp parallel
{

else {
#pragma omp master
{
MPI_Recv(&n_elements_recieved, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &status);
MPI_Recv(a2, n_elements_recieved, MPI_INT, 0, 0, MPI_COMM_WORLD, &status);
}
#pragma omp barrier
#pragma omp for reduction(+:local_sum)
for (int i = 0; i < n_elements_recieved; i++)
local_sum += a2[i];
}
}
Hybrid Array Sum
• mpi_omp_SumArray.c: All Processes
MPI_Reduce(&local_sum, &global_sum, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
Environment Setup
• Setup SSH passwordless login between nodes
• Refer to lab3 slides

• Check & Install OpenMPI, OpenMP if not


Compilation
• OpenMPI wrapper script with OpenMP -fopenmp switch
• mpic++ -fopenmp -o mpi_omp_hello mpi_omp_hello.c
• mpic++ -fopenmp -o mpi_omp_SumArray mpi_omp_SumArray.c
Execution
• Nearly same with pure MPI
• With default thread num in OMP sections
• mpiexec -hostfile hostfile ./mpi_omp_hello
• Specify OMP_NUM_THREADS
• mpiexec -hostfile hostfile -x OMP_NUM_THREADS=3 ./mpi_omp_hello
• -x: Export an environment variable to the remote nodes before executing the
program, optionally specifying a value
• Specify OMP_NUM_THREADS for different hosts
• mpiexec -n 1 --host csl2wk01 -x OMP_NUM_THREADS=3 ./mpi_omp_hello : -n
2 --host csl2wk02:2 -x OMP_NUM_THREADS=2 ./mpi_omp_hello
Practice
• Implement the code of vector addition using MPI and OpenMP
• Sample code: ./practice/mpi_openmp/vector_addition.c
• Solution: ./practice/mpi_openmp/vector_addition_solution.c
Part 2: MPI + CUDA
Hybrid CUDA and MPI: Motivation
• MPI is easy to exchange data located at different processors
• CPU <-> CPU: Traditional MPI
• GPU <-> GPU: CUDA-Aware MPI
• MPI+CUDA makes the application run more efficiently
• All operations that are required to carry out the message transfer can be pipelined
• Acceleration technologies like GPUDirect can be utilized by the MPI library transparently to the
user.
Unified Virtual Addressing (UVA)
• No UVA: Separate Address Spaces vs. UVA

• UVA: One address space for all CPU and GPU memory
• Determine physical memory location from a pointer value
• Enable libraries to simplify their interfaces (e.g. MPI and cudaMemcpy)
• Supported on devices with compute capability 2.0
UVA Data Exchange with MPI
Example: Matrix Multiplication
• The root process generates two random matrices of input size and
stores them in a 1-D array in Row-major order.
• The first matrix is divided into columns depending on the number of
input processors and each part is sent to a separate GPU
(MPI_Scatter)
• The second matrix (Matrix B) is broadcasted to all nodes and copied
on all GPUs to perform computation. (MPI_Bcast)
• Each GPU computes its own part of the result matrix and sends the
result back to the root process
• Results are gathered into a resultant matrix. (MPI_Gather)
Code
• Without UVA. Send the data in the host memory.
• matvec.cu

• With UVA. Send the data in the device memory.


• matvec_uva.cu
matvec.cu(Without UVA)
• 1. Generate the data in the master process:
• Status = IntializingMatrixVectors(&MatrixA, &MatrixB, &ResultVector,
RowsNo, ColsNo, RowsNo2, ColsNo2);

• 2. Send data to different processes in host memory:


• MPI_Bcast(MatrixB, matrixBsize, MPI_FLOAT, 0, MPI_COMM_WORLD);
• MPI_Scatter(MatrixA, ScatterSize * ColsNo, MPI_FLOAT, MyMatrixA,
ScatterSize * ColsNo, MPI_FLOAT, 0, MPI_COMM_WORLD);
matvec.cu(Without UVA)
• 3. Allocate the memory in the device memory in each process:
• cudaMalloc( (void **)&DeviceMyMatrixA, ScatterSize * ColsNo * sizeof(float) ) );
• cudaMalloc( (void **)&DeviceMatrixB, matrixBsize*sizeof(float) ) );
• cudaMalloc( (void **)&DeviceMyResultVector, elements * sizeof(float) ) );

• 4. Copy the Data from host to device in each process:


• cudaMemcpy( (void *)DeviceMyMatrixA, (void *)MyMatrixA, ScatterSize * ColsNo * sizeof(float),
cudaMemcpyHostToDevice );
• cudaMemcpy( (void *)DeviceMatrixB, (void *)MatrixB, matrixBsize*sizeof(float),
cudaMemcpyHostToDevice );

• 5. Do the calculation in each process:


• MatrixVectorMultiplication<<<1, 256>>>(DeviceMyMatrixA, DeviceMatrixB,
DeviceMyResultVector, RowsNo, ColsNo, RowsNo2, ColsNo2, ColsNo, ScatterSize, BLOCKSIZE,
MyRank, NumberOfProcessors);
matvec.cu(Without UVA)
• 6. Copy the result from device to host in each process :
• cudaMemcpy( (void *)MyResultMatrix, (void *)DeviceMyResultVector,
elements * sizeof(float), cudaMemcpyDeviceToHost );

• 7. Gather the result:


• MPI_Gather(MyResultMatrix,elements, MPI_FLOAT, ResultVector,
elements, MPI_FLOAT, 0, MPI_COMM_WORLD);
matvec_uva.cu(With UVA)
• 1. Generate the data in the master process:
• Status = IntializingMatrixVectors(&MatrixA, &MatrixB, &ResultVector,
RowsNo, ColsNo, RowsNo2, ColsNo2);

• 2. Allocate the memory on the device memory in the master process:


• cudaMalloc( (void **)&DeviceRootMatrixA, RowsNo * ColsNo *
sizeof(float) ) ;
• cudaMalloc( (void **)&DeviceRootResultVector, RowsNo * ColsNo2 *
sizeof(float) ) ;

• 3. Copy the Data from host to device in the master process :


• cudaMemcpy( (void *)DeviceRootMatrixA, (void *)MatrixA, RowsNo *
ColsNo * sizeof(float), cudaMemcpyHostToDevice );
matvec_uva.cu(With UVA)

• 4. Allocating the memory in the device memory in each process:


• cudaMalloc( (void **)&DeviceMyMatrixA, ScatterSize * ColsNo *
sizeof(float) ) ;
• cudaMalloc( (void **)&DeviceMatrixB, matrixBsize*sizeof(float) ) );
• cudaMalloc( (void **)&DeviceMyResultVector, elements * sizeof(float) ) ;

• 5. Send data to different processes in device memory:


• MPI_Bcast(DeviceMatrixB, matrixBsize, MPI_FLOAT, 0,
MPI_COMM_WORLD);
• MPI_Scatter(DeviceRootMatrixA, ScatterSize * ColsNo, MPI_FLOAT,
DeviceMyMatrixA, ScatterSize * ColsNo, MPI_FLOAT, 0,
MPI_COMM_WORLD);
matvec_uva.cu(With UVA)
• 6. Do the calculation in each process:
• MatrixVectorMultiplication<<<1, 256>>>(DeviceMyMatrixA, DeviceMatrixB,
DeviceMyResultVector, RowsNo, ColsNo, RowsNo2, ColsNo2, ColsNo,
ScatterSize, BLOCKSIZE, MyRank, NumberOfProcessors);

• 7. Gather the result in the device memory in the master process:


• MPI_Gather(DeviceMyResultVector, elements, MPI_FLOAT,
DeviceRootResultVector, elements, MPI_FLOAT, 0, MPI_COMM_WORLD);

• 8. Copy the result from device to host in the master process :


• cudaMemcpy( (void *)ResultVector, (void *)DeviceRootResultVector,
RowsNo * ColsNo2 * sizeof(float), cudaMemcpyDeviceToHost );
Environment Setup
• CUDA 11 and OpenMP 3.0
• setenv PATH "${PATH}:/usr/local/cuda-11/bin/”
Compilation
• 1. Put both MPI and CUDA code in a single file, matvec.cu.
• This program can be compiled using nvcc, which internally uses
gcc/g++ to compile the C/C++ code, and linked to MPI library:
• /usr/local/cuda/bin/nvcc -Xcompiler -g -w -I.. -I
/usr/local/software/openmpi/include/ -L
/usr/local/software/openmpi/lib –lmpi matvec.cu -o newfloatmatvec
Compilation
• 2. Have MPI and CUDA code separate in two files: main.c and
multiply.cu respectively. These two files can be compiled using mpicc,
and nvcc respectively into object files (.o) and combined into a single
executable file using mpicc.
• 3. This third option is an opposite compilation of the first one,
using mpicc, meaning that you have to link to your CUDA library.
Execution
• Use mpiexec. If compiled with nvcc, include the OpenMPI lib path in
LD_LIBRARY_PATH (if OpenMPI is not installed in the default path)
• mpiexec --host csl2wk26:1,csl2wk25:1 -x
LD_LIBRARY_PATH=/usr/local/software/openmpi/lib:$LD_LIBRARY_PATH
./newfloatmatvec 4 3 3 4 -p -v
Practice
• Implement the code of vector addition using MPI and CUDA
• Without UVA
• Sample code: ./practice/mpi_cuda/vector_addition.cu
• Solution: ./practice/mpi_cuda/vector_addition_solution.cu
• With UVA
• Sample code: ./practice/mpi_cuda/ vector_addition_uva.cu
• Solution: ./practice/mpi_cuda/ vector_addition_uva.cu
Reference commands: run_lab4.sh
ompi_info | grep -i thread

https://www.open-mpi.org/faq/?category=runcuda
ompi_info --parsable --all | grep mpi_built_with_cuda_support:value

You might also like