0% found this document useful (0 votes)
26 views

A Tutorial On Parallel Computing On Shared Memory Systems

This document provides an overview of parallel computing on shared memory systems using OpenMP. It discusses what OpenMP is, its history and components. OpenMP uses a fork-join model of parallelism where the master thread creates a team of threads to execute parallel regions. The document describes OpenMP directives, runtime routines, environment variables, and how it handles data scoping. It also discusses work sharing constructs like parallel loops and sections. Features like scheduling, critical sections, and potential problems with parallelizations are covered at a high level.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

A Tutorial On Parallel Computing On Shared Memory Systems

This document provides an overview of parallel computing on shared memory systems using OpenMP. It discusses what OpenMP is, its history and components. OpenMP uses a fork-join model of parallelism where the master thread creates a team of threads to execute parallel regions. The document describes OpenMP directives, runtime routines, environment variables, and how it handles data scoping. It also discusses work sharing constructs like parallel loops and sections. Features like scheduling, critical sections, and potential problems with parallelizations are covered at a high level.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

A tutorial on parallel computing

on shared memory systems

OpenMP, Compiler level parallelization and Fortran 2008 co-


arrays
What OpenMP is (source https://computing.llnl.gov/tutorials/openMP/)
OpenMP Is:              
•An Application Program Interface (API) that may be used to explicitly direct multi-threaded, shared memory parallelism.
•Comprised of three primary API components:
•Compiler Directives
•Runtime Library Routines
•Environment Variables

OpenMP Is Not:

•Meant for distributed memory parallel systems (by itself)


•Necessarily implemented identically by all vendors
•Guaranteed to make the most efficient use of shared memory
•Required to check for data dependencies, data conflicts, race conditions, deadlocks, or code sequences that cause a program to be
classified as non-conforming
•Designed to handle parallel I/O. The programmer is responsible for synchronizing input and output.

History:

•In the early 90's, vendors of shared-memory machines supplied similar, directive-based, Fortran programming extensions.
•The user would augment a serial Fortran program with directives specifying which loops were to be parallelized
•The compiler would be responsible for automatically parallelizing such loops across the SM processors. Implementations were all
functionally similar but were diverging.
•First attempt at a standard ANSI X3H5 in 1994. It was never adopted, since distributed memory machines became popular.
•However, not long after this, newer shared memory machine architectures started to become prevalent, and interest resumed.
•The OpenMP standard specification started in the spring of 1997, taking over where ANSI X3H5 had left off.
Shared Memory Model
• OpenMP is designed for multi-processor/core, shared memory
machines. The underlying architecture can be shared memory Uniform
Memory Access or Non-uniform Memory Access.

Uniform Memory Access (UMA) Non-uniform Memory Access (NUMA)


• The idealized use of OpenMP in HPC on modern clusters is at
increasing node-level efficiency. These are called hybrid codes.
OpenMP - MPI

Open MP parallelization in each node

MPI parallelization between nodes


Hybrid

Open MP parallelization in each node


Open MP Parallelization model
• OpenMP fork-join model of thread-level parallelism

• All OpenMP programs begin on a single master thread.


• Given a suitable OpenMP directive to create a parallel region, the master thread then
creates a team of parallel threads through a FORK.
• The statements in the program within the parallel region construct are executed in
parallel among the threads.
• JOIN: When the team threads complete the statements in the parallel region construct,
they synchronize and terminate, leaving only the master thread.
• The number of parallel regions and the threads that comprise them are arbitrary.
Components of OpenMP
• The OpenMP API is comprised of three distinct components:
1) Compiler Directives
2) Runtime Library Routines
3) Environment Variables
• Compiler directives are for:
1) Spawning a parallel region
2) Dividing blocks of code among threads
3) Distributing loop iterations between threads
4) Serializing sections of code
5) Synchronization of work among threads
sentinel directive-name [clause, ...]

• Syntax: !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(BETA,PI)


Run-time Library Routines and
Environment variables
• Run time library routines (vendor specific), among other
things, are used to:
1) Set number of threads, inquire the number of active
threads
2) querying a thread's unique identifier (thread ID), a
thread's ancestor's identifier, the thread team size.
3) Querying if in a parallel region, and at what level.
• Environment variables, among other things, are used to :
1) Set the number of threads
2) Specify how loop interactions are divided
3) Bind threads to processors
OpenMP code structure and data scoping
PROGRAM HELLO

INTEGER VAR1, VAR2, VAR3 Serial code . . .


Data Scoping:
Beginning of parallel region.

Fork a team of threads.


• Because OpenMP is a shared memory
Specify variable scoping
!$OMP PARALLEL PRIVATE(VAR1, VAR2) SHARED(VAR3)
programming model, most data within a
parallel region is shared by default.
Parallel region executed by all threads

Other OpenMP directives


• All threads in a parallel region can
Run-time Library calls
access this shared data simultaneously.
All threads join master thread and disband • OpenMP provides a way for the
!$OMP END PARALLEL
programmer to explicitly specify how
Resume serial code . . .
data is "scoped" if the default shared
scoping is not desired.
END
General rules for OpenMP directives
• Comments can not appear on the same line as a directive
• Only one directive-name may be specified per directive
• Fortran compilers which are OpenMP enabled generally include a
command line option which instructs the compiler to activate and
interpret all OpenMP directives.
• Several Fortran OpenMP directives come in pairs and have the form
shown below. The "end" directive is optional but advised for readability.
!$OMP directive

[ structured block of code ]

!$OMP end directive


Features of OpenMP parallelism
PROGRAM TEST ... SUBROUTINE SUB1 ...
!$OMP PARALLEL ...
!$OMP CRITICAL ...
!$OMP DO
!$OMP END CRITICAL
DO I=...
... CALL SUB1 END
... ENDDO
SUBROUTINE SUB2 ...
!$OMP END DO
!$OMP SECTIONS ...
... CALL SUB2 ...
!$OMP END SECTIONS ...
!$OMP END PARALLEL
END
STATIC EXTENT ORPHANED DIRECTIVES
The DO directive occurs within an enclosing parallel The CRITICAL and SECTIONS directives occur outside
region an enclosing parallel region
DYNAMIC EXTENT
The CRITICAL and SECTIONS directives occur within the dynamic extent of the DO and PARALLEL directives.
Example 2: Work sharing constructs
SECTIONS - breaks work into
separate, discrete sections. Each
DO / for - shares iterations of a loop section is executed by a thread. SINGLE - serializes a
across the team. Represents a type of Can be used to implement a type of section of code
"data parallelism" "functional parallelism".
Scheduling in OpenMP
• SCHEDULE: Describes how iterations of the loop are divided among the
threads in the team. The default schedule is implementation dependent.
The goal is to keep all the threads equally busy for all the time.
• STATIC: Loop iterations are divided into pieces of size chunk and then
statically assigned to threads. If chunk is not specified, the iterations are
evenly (if possible) divided contiguously among the threads.
• A static schedule is good if the work you have can be divided up pretty
evenly (e.g., so all the processors you are using are busy for about the
same amount of time). Most of the overhead work of static is done at
compile time.
Scheduling in OpenMP
• DYNAMIC: Loop iterations are divided into pieces of size chunk, and
dynamically scheduled among the threads; when a thread finishes one
chunk, it is dynamically assigned another. The default chunk size is 1.
• Dynamic is good if your workload is uneven, since one or more
processors might be busy for a long time. It allows the processors with
small workloads to go after other chunks of work and hopefully balance
the work out between processors. It has a larger overhead at runtime,
since work has to be taken off of a queue.
Scheduling in OpenMP
GUIDED: Iterations are dynamically assigned to threads in blocks as
threads request them until no blocks remain to be assigned. Similar to
DYNAMIC except that the block size decreases each time a parcel of work
is given to a thread.
The size of the initial block is proportional to: number_of_iterations / number_of_threads
Subsequent blocks are proportional to number_of_iterations_remaining / number_of_threads
The chunk parameter defines the minimum block size. The default chunk
size is 1.
Note: compilers differ in how GUIDED is implemented as shown in the
"Guided A" and "Guided B" examples below.
Like DYNAMIC, good when
you know the exact loads and
job sizes are uneven.
Overhead at runtime.
Problems with OpenMP parallelizations
• In addition to usual, “serial” bugs, parallel programs can have
“parallel-only” bugs, such as
• Race conditions - When results depend on specific ordering of
commands, which is not enforced.
• Deadlocks - When task(s) wait perpetually for a message/signal
which never come
OpenMP parallelization may lead to race
conditions
• Sometimes, order of instructions across threads creates problems
with shared variables.
• Imagine the sum loop:
!$OMP PARALLEL DO
      DO I = 1, N
        SUM = SUM + (A(I) * B(I))
      ENDDO
!$OMP END PARALLEL DO
Note that SUM is a shared variable. Multiple threads might want to
update its value at the same time. What then?
This is called a race condition.
OpenMP reduction clause: One solution for
race
• Automatic reduction of loop variables to local copies.

• Structure of reduced loops:

!$omp parallel shared(n) private(i,j,prime)
!$omp do reduction (+:sum)
…Code body…
!$omp end do
!$omp end parallel
Other solutions: Synchronization
constructs
• CRITICAL directive:
1) The CRITICAL directive specifies a region of code that must
be executed by only one thread at a time.
2) If a thread is currently executing inside a CRITICAL region
and another thread reaches that CRITICAL region and
attempts to execute it, it will block until the first thread exits
that CRITICAL region.
SINGLE, MASTER, BARRIER: examples
Some practical examples
• A matrix multiplication code: naïve parallelization example.
• A more generic approach for solutions on shared and
distributed memory systems – domain decomposition:

Domain decomposition has many fancy


definitions. To me, it is simply block operations
on matrices – many practical problems in
numerical computation can be reduced to this
simple goal.
Pseudo-code for block matrix
multiplication
• Here is the pseudo-code (the code is not specific to any
language, just explains logic)
Strengths?
1) Easily vectorizable (SIMD upscaling).
2) One, non-data dependent, loop.
3) Only bottle neck is data transmission.
Domain-decomposition for parallel linear
solvers
•  The class of methods that we have studied so far are called
direct solvers – they are guaranteed to succeed in a fixed
number of steps.
• However, for special matrices which are either sparse or
sparse-like (for example, extremely diagonally dominant) can be
more efficiently solved by generic methods which solve the
problem iteratively.

such that we ultimately have .


Two common iterative solvers
•Central
  idea: . Then use and for Jacobi
Jacobi:

Gauss-Siedel:

What are P and N?


Convergence of iterative methods
• The
  iterative methods we will study are guaranteed to converge if
the underlying matrix is diagonally dominant.
• Let’s denote the iteration matrix by , the error in the th step is .
Then, the iteration on the error is:
.
This is convergent only if the spectral radius of <1.
In Jacobi, . Therefore convergence is guaranteed for diagonally
dominant matrices! Example of diagonally dominant matrices – our
FD matrices, they are sparse and diagonally dominant usually!
Let’s see a naïve parallelization.
Domain decomposed Jacobi

Jacobi on sparse blocks

PDE on blocks, then block Jacobi on each


block! Nested parallelization ideal for hybrid
MPI-OpenMP codes.

You might also like