0% found this document useful (0 votes)

23 views32 pages

Tutorial 4

Uploaded by

1378311976dcr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views32 pages

Tutorial 4

Uploaded by

1378311976dcr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

COMP 4007: Parallel Processing and Computer Architecture

Tutorial 4:
Hybrid Parallel Programming Models
TA: Hucheng Liu ([email protected])
Contents
• Part 1: MPI + OpenMP

• Part 2: MPI + CUDA

Part 1: MPI + OpenMP
MPI+ OpenMP: Motivation
• Two-level Parallelization Node0 Node0
• Mimics hardware layout of cluster Process Process
MPI
• MPI between nodes or CPU sockets T0 T1 T0 T1
• OpenMP within shared-memory nodes or processors T2 T3 T2 T3

• Pros OpenMP OpenMP

• No message passing inside of the shared-memory MPI MPI

processor (SMP) nodes Node0 Node0
• No topology problem Process Process
T0 T1 T0 T1
• Cons MPI
T2 T3 T2 T3
• Should be careful with sleeping threads
OpenMP OpenMP
• Not always better than pure MPI or OpenMP
MPI Rules with OpenMP
• Special MPI init for multi-threaded MPI processes:
int MPI_Init_thread( int* argc, char** argv[],
int thread_level_required,
int* thead_level_provided);
int MPI_Query_thread( int* thread_level_provided);
int MPI_Is_main_thread(int* flag);

• thread_level_required specifies the requested level of thread support.

• Actual level of support is then returned into thread_level_provided.
Four Options for Thread Support
• MPI_THREAD_SINGLE
• Only one thread will execute, EQUALS to MPI_Init
• MPI_THREAD_FUNNELED
• Only master thread will make MPI-calls
• MPI_THREAD_SERIALIZED
• Multiple threads may make MPI-calls, but only one at a time
• MPI_THREAD_MULTIPLE
• Multiple threads may call MPI with no restrictions
• In most cases MPI_THREAD_FUNNELED provides the best choice for
hybrid programs
Hybrid Hello
• mpi_omp_hello.c
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name(processor_name, &namelen);

#pragma omp parallel default(shared) private(iam, np)

{
np = omp_get_num_threads();
iam = omp_get_thread_num();
printf("Hybrid: Hello from thread %d out of %d from process %d out of
%d on %s\n", iam, np, rank, numprocs, processor_name);
}
Hybrid Array Sum: Funneled MPI calls
• mpi_omp_SumArray.c: Process 0
#pragma omp parallel
{
if (pid == 0) {
…
#pragma omp master
{
for (int i = 1; i < np; i++) {
MPI_Send(&elements_per_process, …);
MPI_Send(&a[i * elements_per_process…);
}
}
#pragma omp barrier
#pragma omp for reduction(+:local_sum)
for (int i = 0; i < elements_per_process; i++)
local_sum += a[i];
}
…
}
Hybrid Array Sum
• mpi_omp_SumArray.c: Other Processes
#pragma omp parallel
{
…
else {
#pragma omp master
{
MPI_Recv(&n_elements_recieved, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &status);
MPI_Recv(a2, n_elements_recieved, MPI_INT, 0, 0, MPI_COMM_WORLD, &status);
}
#pragma omp barrier
#pragma omp for reduction(+:local_sum)
for (int i = 0; i < n_elements_recieved; i++)
local_sum += a2[i];
}
}
Hybrid Array Sum
• mpi_omp_SumArray.c: All Processes
MPI_Reduce(&local_sum, &global_sum, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
Environment Setup
• Setup SSH passwordless login between nodes
• Refer to lab3 slides

• Check & Install OpenMPI, OpenMP if not

Compilation
• OpenMPI wrapper script with OpenMP -fopenmp switch
• mpic++ -fopenmp -o mpi_omp_hello mpi_omp_hello.c
• mpic++ -fopenmp -o mpi_omp_SumArray mpi_omp_SumArray.c
Execution
• Nearly same with pure MPI
• With default thread num in OMP sections
• mpiexec -hostfile hostfile ./mpi_omp_hello
• Specify OMP_NUM_THREADS
• mpiexec -hostfile hostfile -x OMP_NUM_THREADS=3 ./mpi_omp_hello
• -x: Export an environment variable to the remote nodes before executing the
program, optionally specifying a value
• Specify OMP_NUM_THREADS for different hosts
• mpiexec -n 1 --host csl2wk01 -x OMP_NUM_THREADS=3 ./mpi_omp_hello : -n
2 --host csl2wk02:2 -x OMP_NUM_THREADS=2 ./mpi_omp_hello
Practice
• Implement the code of vector addition using MPI and OpenMP
• Sample code： ./practice/mpi_openmp/vector_addition.c
• Solution: ./practice/mpi_openmp/vector_addition_solution.c
Part 2: MPI + CUDA
Hybrid CUDA and MPI: Motivation
• MPI is easy to exchange data located at different processors
• CPU <-> CPU: Traditional MPI
• GPU <-> GPU: CUDA-Aware MPI
• MPI+CUDA makes the application run more efficiently
• All operations that are required to carry out the message transfer can be pipelined
• Acceleration technologies like GPUDirect can be utilized by the MPI library transparently to the
user.
Unified Virtual Addressing (UVA)
• No UVA: Separate Address Spaces vs. UVA

• UVA: One address space for all CPU and GPU memory
• Determine physical memory location from a pointer value
• Enable libraries to simplify their interfaces (e.g. MPI and cudaMemcpy)
• Supported on devices with compute capability 2.0
UVA Data Exchange with MPI
Example: Matrix Multiplication
• The root process generates two random matrices of input size and
stores them in a 1-D array in Row-major order.
• The first matrix is divided into columns depending on the number of
input processors and each part is sent to a separate GPU
(MPI_Scatter)
• The second matrix (Matrix B) is broadcasted to all nodes and copied
on all GPUs to perform computation. (MPI_Bcast)
• Each GPU computes its own part of the result matrix and sends the
result back to the root process
• Results are gathered into a resultant matrix. (MPI_Gather)
Code
• Without UVA. Send the data in the host memory.
• matvec.cu

• With UVA. Send the data in the device memory.

• matvec_uva.cu
matvec.cu(Without UVA)
• 1. Generate the data in the master process:
• Status = IntializingMatrixVectors(&MatrixA, &MatrixB, &ResultVector,
RowsNo, ColsNo, RowsNo2, ColsNo2);

• 2. Send data to different processes in host memory:

• MPI_Bcast(MatrixB, matrixBsize, MPI_FLOAT, 0, MPI_COMM_WORLD);
• MPI_Scatter(MatrixA, ScatterSize * ColsNo, MPI_FLOAT, MyMatrixA,
ScatterSize * ColsNo, MPI_FLOAT, 0, MPI_COMM_WORLD);
matvec.cu(Without UVA)
• 3. Allocate the memory in the device memory in each process:
• cudaMalloc( (void **)&DeviceMyMatrixA, ScatterSize * ColsNo * sizeof(float) ) );
• cudaMalloc( (void **)&DeviceMatrixB, matrixBsize*sizeof(float) ) );
• cudaMalloc( (void **)&DeviceMyResultVector, elements * sizeof(float) ) );

• 4. Copy the Data from host to device in each process:

• cudaMemcpy( (void *)DeviceMyMatrixA, (void *)MyMatrixA, ScatterSize * ColsNo * sizeof(float),
cudaMemcpyHostToDevice );
• cudaMemcpy( (void *)DeviceMatrixB, (void *)MatrixB, matrixBsize*sizeof(float),
cudaMemcpyHostToDevice );

• 5. Do the calculation in each process:

• MatrixVectorMultiplication<<<1, 256>>>(DeviceMyMatrixA, DeviceMatrixB,
DeviceMyResultVector, RowsNo, ColsNo, RowsNo2, ColsNo2, ColsNo, ScatterSize, BLOCKSIZE,
MyRank, NumberOfProcessors);
matvec.cu(Without UVA)
• 6. Copy the result from device to host in each process :
• cudaMemcpy( (void *)MyResultMatrix, (void *)DeviceMyResultVector,
elements * sizeof(float), cudaMemcpyDeviceToHost );

• 7. Gather the result:

• MPI_Gather(MyResultMatrix,elements, MPI_FLOAT, ResultVector,
elements, MPI_FLOAT, 0, MPI_COMM_WORLD);
matvec_uva.cu(With UVA)
• 1. Generate the data in the master process:
• Status = IntializingMatrixVectors(&MatrixA, &MatrixB, &ResultVector,
RowsNo, ColsNo, RowsNo2, ColsNo2);

• 2. Allocate the memory on the device memory in the master process:

• cudaMalloc( (void **)&DeviceRootMatrixA, RowsNo * ColsNo *
sizeof(float) ) ;
• cudaMalloc( (void **)&DeviceRootResultVector, RowsNo * ColsNo2 *
sizeof(float) ) ;

• 3. Copy the Data from host to device in the master process :

• cudaMemcpy( (void *)DeviceRootMatrixA, (void *)MatrixA, RowsNo *
ColsNo * sizeof(float), cudaMemcpyHostToDevice );
matvec_uva.cu(With UVA)

• 4. Allocating the memory in the device memory in each process:

• cudaMalloc( (void **)&DeviceMyMatrixA, ScatterSize * ColsNo *
sizeof(float) ) ;
• cudaMalloc( (void **)&DeviceMatrixB, matrixBsize*sizeof(float) ) );
• cudaMalloc( (void **)&DeviceMyResultVector, elements * sizeof(float) ) ;

• 5. Send data to different processes in device memory:

• MPI_Bcast(DeviceMatrixB, matrixBsize, MPI_FLOAT, 0,
MPI_COMM_WORLD);
• MPI_Scatter(DeviceRootMatrixA, ScatterSize * ColsNo, MPI_FLOAT,
DeviceMyMatrixA, ScatterSize * ColsNo, MPI_FLOAT, 0,
MPI_COMM_WORLD);
matvec_uva.cu(With UVA)
• 6. Do the calculation in each process:
• MatrixVectorMultiplication<<<1, 256>>>(DeviceMyMatrixA, DeviceMatrixB,
DeviceMyResultVector, RowsNo, ColsNo, RowsNo2, ColsNo2, ColsNo,
ScatterSize, BLOCKSIZE, MyRank, NumberOfProcessors);

• 7. Gather the result in the device memory in the master process:

• MPI_Gather(DeviceMyResultVector, elements, MPI_FLOAT,
DeviceRootResultVector, elements, MPI_FLOAT, 0, MPI_COMM_WORLD);

• 8. Copy the result from device to host in the master process :

• cudaMemcpy( (void *)ResultVector, (void *)DeviceRootResultVector,
RowsNo * ColsNo2 * sizeof(float), cudaMemcpyDeviceToHost );
Environment Setup
• CUDA 11 and OpenMP 3.0
• setenv PATH "${PATH}:/usr/local/cuda-11/bin/”
Compilation
• 1. Put both MPI and CUDA code in a single file, matvec.cu.
• This program can be compiled using nvcc, which internally uses
gcc/g++ to compile the C/C++ code, and linked to MPI library:
• /usr/local/cuda/bin/nvcc -Xcompiler -g -w -I.. -I
/usr/local/software/openmpi/include/ -L
/usr/local/software/openmpi/lib –lmpi matvec.cu -o newfloatmatvec
Compilation
• 2. Have MPI and CUDA code separate in two files: main.c and
multiply.cu respectively. These two files can be compiled using mpicc,
and nvcc respectively into object files (.o) and combined into a single
executable file using mpicc.
• 3. This third option is an opposite compilation of the first one,
using mpicc, meaning that you have to link to your CUDA library.
Execution
• Use mpiexec. If compiled with nvcc, include the OpenMPI lib path in
LD_LIBRARY_PATH (if OpenMPI is not installed in the default path)
• mpiexec --host csl2wk26:1,csl2wk25:1 -x
LD_LIBRARY_PATH=/usr/local/software/openmpi/lib:$LD_LIBRARY_PATH
./newfloatmatvec 4 3 3 4 -p -v
Practice
• Implement the code of vector addition using MPI and CUDA
• Without UVA
• Sample code: ./practice/mpi_cuda/vector_addition.cu
• Solution: ./practice/mpi_cuda/vector_addition_solution.cu
• With UVA
• Sample code: ./practice/mpi_cuda/ vector_addition_uva.cu
• Solution: ./practice/mpi_cuda/ vector_addition_uva.cu
Reference commands: run_lab4.sh
ompi_info | grep -i thread

https://www.open-mpi.org/faq/?category=runcuda
ompi_info --parsable --all | grep mpi_built_with_cuda_support:value

Bcs702 Parallel Computing Module 1
No ratings yet
Bcs702 Parallel Computing Module 1
35 pages
Unix/Linux Notes
100% (115)
Unix/Linux Notes
1,157 pages
Multi Gpu Programming With Mpi
No ratings yet
Multi Gpu Programming With Mpi
93 pages
Advanced OpenACC Course Lecture2 Multi GPU 20160602
No ratings yet
Advanced OpenACC Course Lecture2 Multi GPU 20160602
91 pages
Multicore Code Entwicklung
No ratings yet
Multicore Code Entwicklung
33 pages
U1 Programa4 S12021
No ratings yet
U1 Programa4 S12021
6 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
Pdcnotes
No ratings yet
Pdcnotes
23 pages
HPC MPI LAB 1 Vector Addition
No ratings yet
HPC MPI LAB 1 Vector Addition
9 pages
Sunil Kumar L 24
No ratings yet
Sunil Kumar L 24
21 pages
PDC Lab 2-5
No ratings yet
PDC Lab 2-5
5 pages
Cp4292 Multicore Lab Multicore Lab Removed
No ratings yet
Cp4292 Multicore Lab Multicore Lab Removed
37 pages
.Trashed-1650000204-Hpc Prac Exam
No ratings yet
.Trashed-1650000204-Hpc Prac Exam
5 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
Intro To MPI
No ratings yet
Intro To MPI
44 pages
MPI Python Workshop Day1 Fall2024
No ratings yet
MPI Python Workshop Day1 Fall2024
22 pages
PDC Experiments
No ratings yet
PDC Experiments
11 pages
Lammps Overdrive
No ratings yet
Lammps Overdrive
28 pages
Master in High Performance Computing Advanced Parallel Programming LABS
No ratings yet
Master in High Performance Computing Advanced Parallel Programming LABS
2 pages
3.introduction To Parallelism
No ratings yet
3.introduction To Parallelism
64 pages
Parallel Computing Lab Manual PDF
100% (1)
Parallel Computing Lab Manual PDF
51 pages
Parallel Programming For Multicore Machines Using OpenMP and MPI Lecture Notes (Dr. Constantinos Evangelinos) (Z-Library)
No ratings yet
Parallel Programming For Multicore Machines Using OpenMP and MPI Lecture Notes (Dr. Constantinos Evangelinos) (Z-Library)
292 pages
Mit Openmp Mpi
No ratings yet
Mit Openmp Mpi
77 pages
MPI Tutorial Fall Break 2022
No ratings yet
MPI Tutorial Fall Break 2022
60 pages
Assignment 04
No ratings yet
Assignment 04
16 pages
High Performance Computing (HPC) - Lec3
No ratings yet
High Performance Computing (HPC) - Lec3
35 pages
Exercise - 4
No ratings yet
Exercise - 4
8 pages
Untitled Document
No ratings yet
Untitled Document
23 pages
MAP Lab Completed
No ratings yet
MAP Lab Completed
29 pages
PDCLabMan Updated
No ratings yet
PDCLabMan Updated
46 pages
Raspberry Pi - OpenMP C++ Tutorial IUB
No ratings yet
Raspberry Pi - OpenMP C++ Tutorial IUB
16 pages
Pseudo Code of Mpi Programs
No ratings yet
Pseudo Code of Mpi Programs
22 pages
Gauss
No ratings yet
Gauss
7 pages
Parallel & Distributed Computing: MPI - Message Passing Interface
No ratings yet
Parallel & Distributed Computing: MPI - Message Passing Interface
49 pages
Rifat
No ratings yet
Rifat
26 pages
Mpi Openmp Examples
No ratings yet
Mpi Openmp Examples
27 pages
Untitled Document
No ratings yet
Untitled Document
23 pages
ECE 1747H: Parallel Programming: Message Passing (MPI)
No ratings yet
ECE 1747H: Parallel Programming: Message Passing (MPI)
67 pages
MAP Lab Mannual
No ratings yet
MAP Lab Mannual
24 pages
Code: First Method:: (1) Write A C Program Using Open MP To Estimate The Value of PI (Use Minimum Two Methods)
No ratings yet
Code: First Method:: (1) Write A C Program Using Open MP To Estimate The Value of PI (Use Minimum Two Methods)
8 pages
University of Moratuwa Department of Information Technology B18 - L4S1 - IN 4700 - Semester I Lab Session 04 - 2022
No ratings yet
University of Moratuwa Department of Information Technology B18 - L4S1 - IN 4700 - Semester I Lab Session 04 - 2022
24 pages
HPC Int2 Key
No ratings yet
HPC Int2 Key
10 pages
RajSingh HPCexp6
No ratings yet
RajSingh HPCexp6
4 pages
CP4292 Mcap
No ratings yet
CP4292 Mcap
15 pages
CUDA-Multiple GPUs
No ratings yet
CUDA-Multiple GPUs
36 pages
PC pgms
No ratings yet
PC pgms
14 pages
As 3
No ratings yet
As 3
2 pages
ECE 1747H: Parallel Programming: Message Passing (MPI)
No ratings yet
ECE 1747H: Parallel Programming: Message Passing (MPI)
67 pages
Pcap Cse 3263 Lab Manual 2023
No ratings yet
Pcap Cse 3263 Lab Manual 2023
70 pages
Lab 1
No ratings yet
Lab 1
2 pages
Parallel Programming
No ratings yet
Parallel Programming
108 pages
Open MpMPIExampleRun
No ratings yet
Open MpMPIExampleRun
6 pages
Cse 4001-Parallel and Distributed Computing Lab Digital Assessment-1 Name: Avulapati Anusha REG - NO: 17BCE0435
No ratings yet
Cse 4001-Parallel and Distributed Computing Lab Digital Assessment-1 Name: Avulapati Anusha REG - NO: 17BCE0435
5 pages
HPC Summary
No ratings yet
HPC Summary
17 pages
Programming Assignment: On Openmp
No ratings yet
Programming Assignment: On Openmp
19 pages
Mpi Openmp Handouts
No ratings yet
Mpi Openmp Handouts
67 pages
PRACE 2012-02 MPI OpenMP Rabenseifner
No ratings yet
PRACE 2012-02 MPI OpenMP Rabenseifner
171 pages
Manual
No ratings yet
Manual
5 pages
Inf3380 Oblig2 2011
No ratings yet
Inf3380 Oblig2 2011
3 pages
OpenMPBoothTalk PyOMP
No ratings yet
OpenMPBoothTalk PyOMP
25 pages
C Programming for the Pc the Mac and the Arduino Microcontroller System
From Everand
C Programming for the Pc the Mac and the Arduino Microcontroller System
Peter D Minns
No ratings yet
DCM General Healcheck
No ratings yet
DCM General Healcheck
13 pages
MC0070
No ratings yet
MC0070
17 pages
Bca 2324
No ratings yet
Bca 2324
46 pages
An Operating Systems Course: With Projects in Java
No ratings yet
An Operating Systems Course: With Projects in Java
3 pages
State Machine and Concurrent Process Model
100% (1)
State Machine and Concurrent Process Model
52 pages
Physical vs. Logical Storage
No ratings yet
Physical vs. Logical Storage
11 pages
Lesson 1 - Types of Computers and Their Parts
No ratings yet
Lesson 1 - Types of Computers and Their Parts
16 pages
Amie Section B Computer Science Question Paper Operating System
100% (1)
Amie Section B Computer Science Question Paper Operating System
23 pages
Operating System Structure Operating System Structure
No ratings yet
Operating System Structure Operating System Structure
3 pages
Assignment MLQ Paging v20
No ratings yet
Assignment MLQ Paging v20
16 pages
Unix Process Management
No ratings yet
Unix Process Management
9 pages
Unix and Shell Programming Anoop Chaturvedi BL Rai Instant Download
No ratings yet
Unix and Shell Programming Anoop Chaturvedi BL Rai Instant Download
56 pages
Operational Amplifier (OPAMP) : Ideal OPAMP, Basic OPAMP
No ratings yet
Operational Amplifier (OPAMP) : Ideal OPAMP, Basic OPAMP
11 pages
3.EIOT UNIT 2 - Context Switching
No ratings yet
3.EIOT UNIT 2 - Context Switching
13 pages
Chapter 4: Threads: Silberschatz, Galvin and Gagne ©2013 Operating System Concepts - 9 Edition
No ratings yet
Chapter 4: Threads: Silberschatz, Galvin and Gagne ©2013 Operating System Concepts - 9 Edition
30 pages
Linux Training Volume1
100% (1)
Linux Training Volume1
433 pages
Chapter 5
No ratings yet
Chapter 5
11 pages
Performance Monitoring Poster v1.0
No ratings yet
Performance Monitoring Poster v1.0
1 page
02 Task Performance 18 PDF
No ratings yet
02 Task Performance 18 PDF
4 pages
Concepts and Notations For Concurrent Programming PDF
No ratings yet
Concepts and Notations For Concurrent Programming PDF
41 pages
CS401 SOLVED MCQs FINAL TERM BY JUNAID
No ratings yet
CS401 SOLVED MCQs FINAL TERM BY JUNAID
34 pages
III Year Syllabus UIT
No ratings yet
III Year Syllabus UIT
14 pages
Lecture Notes On Operating Systems
No ratings yet
Lecture Notes On Operating Systems
7 pages
2-3-Process Sync
No ratings yet
2-3-Process Sync
64 pages
Chapter 1: Introduction To Unix
No ratings yet
Chapter 1: Introduction To Unix
60 pages
PlaceMe Cse lvl1
No ratings yet
PlaceMe Cse lvl1
34 pages
Programming Real-Time Embedded Systems - EPFL
No ratings yet
Programming Real-Time Embedded Systems - EPFL
40 pages
System Calls OS
No ratings yet
System Calls OS
21 pages
OS - 22 Scheme
No ratings yet
OS - 22 Scheme
4 pages

Tutorial 4

Uploaded by

Tutorial 4

Uploaded by

COMP 4007: Parallel Processing and Computer Architecture

• Part 2: MPI + CUDA

• Pros OpenMP OpenMP

• No message passing inside of the shared-memory MPI MPI

• thread_level_required specifies the requested level of thread support.

#pragma omp parallel default(shared) private(iam, np)

• Check & Install OpenMPI, OpenMP if not

• With UVA. Send the data in the device memory.

• 2. Send data to different processes in host memory:

• 4. Copy the Data from host to device in each process:

• 5. Do the calculation in each process:

• 7. Gather the result:

• 2. Allocate the memory on the device memory in the master process:

• 3. Copy the Data from host to device in the master process :

• 4. Allocating the memory in the device memory in each process:

• 5. Send data to different processes in device memory:

• 7. Gather the result in the device memory in the master process:

• 8. Copy the result from device to host in the master process :

You might also like