0% found this document useful (0 votes)

68 views

Brief Overview of Parallel Computing

A relatively short presentation concerning parallel computing for the purpose of scientific computing on medium to large clusters. Given at SUNY University at Buffalo's CCR.

Uploaded by

DouglasAndrewBrummellIII

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

68 views

Brief Overview of Parallel Computing

A relatively short presentation concerning parallel computing for the purpose of scientific computing on medium to large clusters. Given at SUNY University at Buffalo's CCR.

Uploaded by

DouglasAndrewBrummellIII

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Background

A Little Motivation

Brief Overview of Parallel Programming

Why do we parallel process in the first place?

Performance - either solve a bigger problem, or to reach solution
faster (or both)

M. D. Jones, Ph.D.

Hardware is becoming (actually it has been for a while now)

intrinsically parallel

Center for Computational Research

University at Buffalo
State University of New York

Parallel programming is not easy. How easy is sequential

programming? Now add another layer (a sizable one, at that) for
parallelism ...

Spring 2014

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

1 / 61

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Background

Spring 2014

3 / 61

Background

Big Picture

Decomposition

Basic idea of parallel programming is a simple one:

problem

decomposition

CPU
CPU
CPU
CPU
CPU
CPU

CPU
CPU

CPU
CPU
CPU
CPU

in which we decompose a large computational problem into available

processing elements (CPUs). How you achieve that decomposition is
where all the fun lies ...
M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

4 / 61

Uniform volume example - decomposition in 1D, 2D, and 3D. Note that
load balancing in this case is simpler if the work load per cell is the
same ...
M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

5 / 61

Background

Demand for HPC

Decomposition (continued)

More Speed!

Why Use Parallel Computation?

Note that, in the modern era, inevitably HPC = Parallel (or concurrent)
Computation. The driving forces behind this are pretty simple - the
desire is:
Solve my problem faster, i.e. I want the answer now (and who
doesnt want that?)
I want to solve a bigger problem than I (or anyone else, for that
matter) have ever before been able to tackle, and do so in a
reasonable (generally reasonable = within a graduate students
time to graduate!)
Nonuniform example - decomposition in 3D for a 16-way parallel
adaptive finite element calculation (in this case using a terrain
elevation). Load balancing in this case is much more complex.
M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Demand for HPC

Spring 2014

6 / 61

M. D. Jones, Ph.D. (CCR/UB)

More Speed!

Brief Overview of Parallel Programming

Demand for HPC

Spring 2014

8 / 61

More Speed!

A Concrete Example

A Size Example

Well, more of a discrete example, actually. Lets consider the

gravitational N-body problem.

On the other side, suppose that we need to diagonalize (or invert) a

dense matrix being used in, say, an eigenproblem derived from some
engineering or multiphysics problem.

Example
Using classical gravitation, we have a very simple (but long-ranged)
force/potential. For each of N bodies, the resulting force is computed
from the other N 1 bodies, thereby requiring N 2 force calculations
per step. If a galaxy consists of approximately 1012 such bodies, and
even the best algorithm for computing requires N log2 N calculations,
that means ' 1012 ln(1012 )/ ln(2) calculations. If each calculation
takes ' 1 sec, that is 40 106 seconds per step. That is about 1.3
CPU-years per step. Ouch!

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

9 / 61

Example
In this problem, we want to increase the resolution to capture the
essential underlying behavior of the physical process being modeled.
So we determine that we need a matrix on the order of, say, 400000
elements. Simply to store this matrix, in 64-bit representation, requires
' 1.28 1012 Bytes of memory, or 1200 GBytes. We could fit this
onto a cluster with say, 103 nodes, each having 4 GBytes of memory,
by distributing the matrix across the individual memories of each
cluster node.

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

10 / 61

Demand for HPC

More Speed!

Demand for HPC

So ...

Scaling Concepts

So the lessons to take from the preceding examples should be clear:

It is the nature of the research (and commercial) enterprise to
always be trying to accomplish larger tasks with ever increasing
speed, or efficiency. For that matter it is human nature.
These examples, while specific, apply quite generally - do you see
the connection between the first example and general molecular
modeling? The second example and finite element (or finite
difference) approaches to differential equations? How about web
searching?

M. D. Jones, Ph.D. (CCR/UB)

Inherent Limitations

Brief Overview of Parallel Programming

Demand for HPC

Spring 2014

11 / 61

By scaling we typically mean the relative performance of a parallel

vs. serial implementation:
Definition (Scaling): Speedup Factor, S(p) is given by
S(p) =

Sequential execution time (using optimal implementation)

Parallel execution time using p processors

so, in the ideal case, S(p) = p.

M. D. Jones, Ph.D. (CCR/UB)

Inherent Limitations

Brief Overview of Parallel Programming

Demand for HPC

Parallel Efficiency

Spring 2014

12 / 61

Inherent Limitations

Inherent Limitations in Parallel Speedup

Using S as the (best) sequential execution time, we note that

S(p) =

Limitations on the maximum speedup:

S
,
p (p)

Fraction of the code, f , can not be made to execute in parallel

Parallel overhead (communication, duplication costs)

Using this serial fraction, f , we can note that

for a lower bound, and the parallel efficiency is given by

p f S + (1 f ) S /p,

S(p)
,
p
S
=
.
p p

E(p) =

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

13 / 61

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

14 / 61

Demand for HPC

Inherent Limitations

Demand for HPC

Inherent Limitations

Amdahls Law

Implications of Amdahls Law

This simplification for p leads directly to Amdahls Law,

Definition (Amdahls Law):

The implications of Amdahls law are pretty straightforward:

The limit as p ,

S
,
f S + (1 f ) S /p
p

.
1 + f (p 1)

S(p)

lim S(p)

1
.
f

If the serial fraction is 5%, the maximum parallel speedup is only

20.

G. M. Amdahl, "Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities, AFIPS Conference
Proceedings 30 (AFIPS Press, Atlantic City, NJ) 483-485, 1967.

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Demand for HPC

Spring 2014

15 / 61

Amdahl's Law

16 / 61

Let p = comm + comp , where comm is the time spent in communication

between parallel processes (unavoidable overhead) and comp is the
time spent in (parallelizable) computation.

256

f=0.001
f=0.01
f=0.02
f=0.05
f=0.1

Spring 2014

Inherent Limitations

A Practical Example

Maximum Parallel Speedup

1 pcomp ,

32
MAX(S(p))

Brief Overview of Parallel Programming

Demand for HPC

Implications of Amdahls Law (contd)

128

M. D. Jones, Ph.D. (CCR/UB)

Inherent Limitations

and

S(p) =

8
4

1
,
p

pcomp
,
comm + comp

p 1 + comm /comp

1
1

M. D. Jones, Ph.D. (CCR/UB)

8
32
16
Number of Processors, p

Brief Overview of Parallel Programming

128

256

Spring 2014

The point here is that it is critical to minimize communication time to

time spent doing computation (recurring theme).
17 / 61

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

18 / 61

Demand for HPC

Inherent Limitations

Demand for HPC

Defeating Amdahls Law

Inherent Limitations

Gustafsons Law
Now let p be held constant as p is increased,
Ss (p) = S /p ,

There are ways to work around the serious implications of Amdahls

law:

= (f S + (1 f )S )/p ,

We assumed that the problem size was fixed, which is (very) often
not the case.

= (f S + (1 f )S )/(f S + (1 f )S /p),

Now consider a case where the problem size is allowed to vary

' p + p(1 p)f . . . .

Assume that now the problem size is fixed such that p is held
constant

= p/(1 (1 p)f ),

Another way of looking at this is that the serial fraction becomes

negligible as the problem size is scaled. Actually that is a pretty good
definition of a scalable code ...
J. L. Gustafson, Reevaluating Amdahls Law, Comm. ACM 31(5), 532-533 (1988).

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Demand for HPC

Spring 2014

19 / 61

M. D. Jones, Ph.D. (CCR/UB)

Inherent Limitations

Brief Overview of Parallel Programming

Overview of Parallel APIs

Spring 2014

20 / 61

Basic Terminology

Scalability

The Parallel Zoo

Definition
(Scalable): An algorithm is scalable if there is a minimal nonzero
efficiency as p and the problem size is allowed to take on any
value

There are many parallel programming models, but roughly they break
down into the following categories:
Threaded models: e.g., OpenMP or Posix threads as the
application programming interface (API)

I like this (equivalent) one better:

Definition
(Scalable): For a scalable code the sequential fraction becomes
negligible as the problem size (and the number of processors) grows

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

21 / 61

Message passing: e.g., MPI = Message Passing Interface

Hybrid methods: combine elements of other models

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

23 / 61

Overview of Parallel APIs

Basic Terminology

Overview of Parallel APIs

Thread Models

Basic Terminology

Thread Models
thread 1
thread 2

a.out
thread 1
thread 2

a.out

data

time

time
thread N

Shared address space (all threads can access data of program

a.out).
Generally limited to single machine except in very specialized
cases.
GPGPU (general purpose GPU computing) uses thread model
(with a large number of threads on the host machine.

thread N

Typical thread model (e.g., OpenMP)

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Overview of Parallel APIs

Spring 2014

24 / 61

M. D. Jones, Ph.D. (CCR/UB)

Basic Terminology

Brief Overview of Parallel Programming

Overview of Parallel APIs

Message Passing Models

Spring 2014

25 / 61

Basic Terminology

Message Passing Models

a.out

PID

task 1

msg send/recv

data

a.out

data

task 1

PID
a.out

PID

task N1
time

Separate address space (tasks can not access data of other

tasks without sending messages or participating in collective
communications).
General purpose model - messages can be sent over network,
shared memory, etc.
Managing communication is generally up to the programmer.

PID

task N1
time

Typical message passing model (e.g., MPI).

Brief Overview of Parallel Programming

PID

data

M. D. Jones, Ph.D. (CCR/UB)

msg send/recv

collective communication (broadcast/reduce/...)

a.out

msg send/recv

task 0

msg send/recv

PID
data

collective communication (broadcast/reduce/...)

data

Spring 2014

26 / 61

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

27 / 61

Overview of Parallel APIs

Basic Terminology

Overview of Parallel APIs

Parallelism at Three Levels

Approaching the Problem

The first step in deciding where to add parallelism to an algorithm or

application is usually to analyze from the point of tasks and data:

Task - macroscopic view in which an algorithm is organized

around separate concurrent tasks

Tasks How do I reduce this problem into a set of tasks that can
be executed concurrently?

Data - structure data in (separately) update-able chunks

Data How do I take the key data and represent it in such a way
that large chunks can be operated on independently (and
thus concurrently)?

Instruction - ILP through the flow of data in a predictable fashion

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Overview of Parallel APIs

Spring 2014

28 / 61

M. D. Jones, Ph.D. (CCR/UB)

Basic Terminology

Brief Overview of Parallel Programming

Overview of Parallel APIs

Dependencies

Spring 2014

29 / 61

Basic Terminology

An Hour of Planning ...

Taking the time to carefully analyze an existing application, or plan a

new one, can really pay off later:

Bear in mind that you always want to minimize the dependencies

between your concurrent tasks and data:

Plan for parallel programming by keeping data and task parallel

constructs/opportunities in mind

On the task level, design your tasks to be as independent as

possible - this can also be temporal, namely take into account
any necessity for ordering the tasks and their execution

Analyze an existing or new program to discover/confirm the

locations of the time-consuming portions

Sharing of data will require synchronization and thus introduce

data dependencies, if not race conditions or contention

M. D. Jones, Ph.D. (CCR/UB)

Basic Terminology

Brief Overview of Parallel Programming

Spring 2014

Optimize by implementing a strategy using data or task parallel

constructs

30 / 61

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

31 / 61

Overview of Parallel APIs

Auto-Parallelization

Overview of Parallel APIs

Auto-Parallelization

Auto-Vectorization

Often called implicit parallelism, some compilers (usally commercial

ones) have the ability to:
automatically parallelize simple (outer) loops (thread-level
parallelism, or TLP)
Note that thread level parallelism is only usable on an SMP
architecture

Brief Overview of Parallel Programming

Overview of Parallel APIs

Spring 2014

Like auto-parallelization, a feature of high-performance (often

commercial) compilers on hardware that supports at least some level
of vectorization:
compilers finds low-level operations that can operate
simultaneously on multiple data elements using a single
instruction
Intel/AMD processors with SSE/SSE2/SSE3 usually limited to 21 23 vector length (hardware limitation that true vector processors
do not share)

automatically vectorize (usually innermost loops) using ILP

M. D. Jones, Ph.D. (CCR/UB)

Auto-Vectorization

User can exert some control through compiler directives (usually

vendor specific) and data organization focused on vector
operations

32 / 61

M. D. Jones, Ph.D. (CCR/UB)

Auto-Vectorization

Brief Overview of Parallel Programming

Overview of Parallel APIs

Spring 2014

33 / 61

Shared Memory APIs

Downside of Automatic Methods

Shared Memory Parallelism

The automatic parallelization schemes have a number of shortfalls:

Here we consider explicit modes of parallel programming on shared

memory architectures:

Very limited in scope - parallelizes only a very small subset of

code
Without directives from the programmer, compiler is very limited in
identifying parallelizable regions of code
Performance is rather easily degraded rather than sped up

Brief Overview of Parallel Programming

Spring 2014

SHMEM, the old Cray shared memory library

Pthreads, UNIX ANSI standard (available on Windows as well)
OpenMP, common specification for compiler directives and API

Wrong results are easy to (automatically) produce

M. D. Jones, Ph.D. (CCR/UB)

HPF, High Performance Fortran

34 / 61

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

35 / 61

Overview of Parallel APIs

Distributed Memory APIs

Overview of Parallel APIs

Distributed Memory Parallel Programming

OpenMP

In this case, mainly message passing, with a few other (compiler

directives again) possibilities:
HPF, High Performance Fortran
Cluster OpenMP, a new product from Intel to push OpenMP into
a distributed memory regime
PVM, Parallel Virtual Machine
MPI, Message Passing Interface, has become the dominant API
for general purpose parallel programming
UPC, Unified Parallel C, extensions to ISO 99 C for parallel
programming (new)

Brief Overview of Parallel Programming

Overview of Parallel APIs

Spring 2014

Synopsis: designed for multi-platform shared memory parallel

programming in C/C++/FORTRAN primarily through the
use of compiler directives and environmental variables
(library routines also available).
Target: Shared memory systems.
Current Status: Specification 1.0 released in 1997, 2.0 in 2002, 2.5 in
2005, 3.0 in 2008, 3.1 in 2011. Widely implemented by
commercial compiler vendors, also now in most
open-source compilers (4.0 specification released in
2013-07, not yet available in production compilers).
More Info: The OpenMP home page:
http://www.openmp.org

CAF, Co-Array Fortran, parallel extensions to Fortran 95/2003

(new, officially part of Fortran 2008)

M. D. Jones, Ph.D. (CCR/UB)

APIs/Language Extensions & Availability

36 / 61

M. D. Jones, Ph.D. (CCR/UB)

APIs/Language Extensions & Availability

Brief Overview of Parallel Programming

Overview of Parallel APIs

OpenMP Availability

Spring 2014

37 / 61

APIs/Language Extensions & Availability

MPI

Message Passing Interface

Platform
Linux x86_64
GNU (>=4.2)
GNU (>=4.4)
GNU (>=4.7)

Version
3.1
3.0
2.5
3.0
3.1

Synopsis: Message passing library specification proposed as a

standard by committee of vendors, implementors, and
users.

Invocation (example)
ifort -openmp -openmp_report2 ...
pgf90 -mp ...
gfortran -fopenmp ...
gfortran -fopenmp ...
gfortran -fopenmp ...

Target: Distributed and shared memory systems.

Current Status: Most popular (and most portable) of the message
passing APIs.
More Info: ANLs main MPI site:
www-unix.mcs.anl.gov/mpi

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

38 / 61

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

39 / 61

Overview of Parallel APIs

APIs/Language Extensions & Availability

Overview of Parallel APIs

MPI Availability at CCR

Platform
Linux IA64
Linux x86_64

APIs/Language Extensions & Availability

Unified Parallel C (UPC)

Unified Parallel C (UPC) is a project based at Lawrence Berkeley

Laboratory to extend C (ISO 99) and provide large-scale parallel
computing on both shared and distributed memory hardware:

Version(+MPI-2)
1.2+(C++,MPI-I/O)
1.2+(C++,MPI-I/O),2.x(various)

Explicit parallel execution (SPMD)

Shared address space (shared/private data like OpenMP), but
exploits data locality

I am starting to favor the commercial Intel MPI for its ease of use,
expecially in terms of supporting multiple networks/protocols.

Primitives for memory management

BSD license (free download)
http://upc.lbl.gov/, community website at
http://upc.gwu.edu/

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Overview of Parallel APIs

Spring 2014

40 / 61

M. D. Jones, Ph.D. (CCR/UB)

APIs/Language Extensions & Availability

Brief Overview of Parallel Programming

Overview of Parallel APIs

Spring 2014

41 / 61

APIs/Language Extensions & Availability

Co-Array Fortran
Simple example of some UPC syntax:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

Co-Array Fortran (CAF) is an extension to Fortran (Fortran 95/2003) to

provide data decomposition for parallel programs (somewhat akin to
UPC and HPF):

shared i n t a l l _ h i t s [THREADS ] ;
...
...
f o r ( i =0; i < m y _ t r i a l s ; i ++) my_hits += h i t ( ) ;
a l l _ h i t s [MYTHREAD] = my_hits ;
upc_barrier ;
i f (MYTHREAD == 0 ) {
t o t a l _ h i t s = 0;
f o r ( i =0; i < THREADS; i ++) {
t o t a l _ h i t s += a l l _ h i t s [ i ] ;
}
p i = 4 . 0 ( ( double ) t o t a l _ h i t s ) / ( ( double ) t r i a l s ) ;
p r i n t f ( " PI e s t i m a t e d t o %10.7 f from %d t r i a l s on %d t h r e a d s . \ n " ,
p i , t r i a l s , THREADS ) ;
}

Original specification by Numrich and Reid, ACM Fortran Forum,

17, no. 2, pp 1-31 (1998).
ISO Fortran committee included co-arrays in next revision to
Fortran standard (c.f. Numrich and Reid, ACM Fortran Forum, 24,
no 2, pp 4-17 (2005), Fortran 2008.
Does not require shared memory (more applicable than OpenMP)
Early compiler work at Rice:
http://www.hipersoft.rice.edu/caf/index.html

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

42 / 61

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

43 / 61

Overview of Parallel APIs

APIs/Language Extensions & Availability

Libraries

Leveraging Existing Parallel Libraries

Simple examples of CAF usage:
1
2
3
4
5
6
7
8

One easy way to utilize parallel programming resources is through

the use of existing parallel libraries. Most common examples (note
orientation around standard mathematical routines):

REAL, DIMENSION (N ) [ ] : : X , Y
X
= Y [ PE ]
! g e t from Y [ PE ]
Y [ PE ]
= X
! p u t i n t o Y [ PE ]
Y[:]
= X
! b r oa d c a s t X
Y [ LIST ] = X
! b r oa d c a s t X over subset o f PE s i n a r r a y LIST
Z(:)
= Y[:]
! collect all Y
S = MINVAL (Y [ : ] ) ! min ( reduce ) a l l Y
B ( 1 :M) [ 1 : N] = S ! S s c a l a r , promoted t o a r r a y o f shape ( 1 :M, 1 : N)

BLAS - Basic Linear Algebra Subroutines

LAPACK - Linear Algebra PACKage (uses BLAS)

UPC and CAF are examples of partitioned global address space

(PGAS) language, for which a large push is being made by
AHPCRC/DARPA as part of the Petascale computing initiative
(also one based on java called Titanium)

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

44 / 61

ScaLAPACK - distributed memory (MPI) solvers for common LAPACK

functions
PetSC - Portable, Extensible Toolkit for Scientific Computation
FFTW - Fastest Fourier Transforms in the West

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Libraries

BLAS

Vendor BLAS

The Basic Linear Algebra Subroutines (BLAS) form a standard set of

library functions for vector and matrix operations:

Vendor implementations of the BLAS:

Level 1 Vector-Vector (e.g. xdot, where x=s,d,c,z)

Level 2 Matrix-Vector (e.g. xaxpy, where x=s,d,c,z)
Level 3 Matrix-Matrix (e.g. xgemm, where x=s,d,c,z)
These routines are generally provided by vendors hand-tuned at the
level of assembly code for optimum performance on a particular
processor. Shared memory versions (multithreaded) and distributed
memory versions available.
www.netlib.org/blas

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

46 / 61

Spring 2014

48 / 61

Libraries

Spring 2014

47 / 61

AMD
Apple
Compaq
Cray
HP
IBM
Intel
NEC
SGI
SUN

M. D. Jones, Ph.D. (CCR/UB)

ACML
Velocity Engine
CXML
libsci
MLIB
ESSL
MKL
PDLIB/SX
SCSL
Sun Performance Library

Brief Overview of Parallel Programming

Libraries

Performance Example

LAPACK

DDOT Performance

The Linear Algebra PACKage (LAPACK) is a library that lies on top of

the BLAS (for optimium performance and parallelism) and provides:

U2 Compute Node
5000
cMKL 8.1.1
Ref (-lblas)

4500
4000

solvers for systems of simultaneous linear equations

L1 cache
L2 cache

3500

least-squares solutions

MFlop/s

3000

eigenvalue problems

2500
2000

singular value problems

Main memory

1500

On CCR systems, the Intel MKL is generally preferred (includes

optimized BLAS and LAPACK)

1000
500
0
1
10

10
Vector Length [8B]

www.netlib.org/lapack

Performance advantage of Intel MKL ddot versus reference (system)

version. Level 3 BLAS routines have even more significant gains.
M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

49 / 61

Libraries

Brief Overview of Parallel Programming

Parallel Program Design Considerations

ScaLAPACK

Spring 2014

50 / 61

Programming Costs

The Scalable LAPACK library, ScaLAPACK, is designed for use on the

largest of problems (typically implemented on distributed memory
systems):
subset of LAPACK routines redesigned for distributed memory
MIMD parallel computers
explicit message passing for interprocessor communication
assumes matrices are laid out in a two-dimensional block cyclic
decomposition
On CCR systems, the Intel Cluster MKL includes
ScaLAPACK libraries (includes optimized BLAS, LAPACK, and
ScaLAPACK)
www.netlib.org/scalapack

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

51 / 61

Consider:
Complexity: parallel codes can be orders of magnitude more complex
(especially those using message passing) - you have to
plan for and deal with multiple instruction/data streams
Portability: parallel codes frequently have long lifetimes (proportional
to the amount of effort invested in them) - all of the serial
application porting issues apply, plus the choice of
parallel API (MPI, OpenMP, and POSIX threads are
currently good portable choices, but implementations of
them can differ from platform to platform)
Resources: overhead for parallel computation can be significant for
smaller calculations
Scalability: limitations in hardware (CPU-memory speed and
contention, for one example) and the parallel algorithms
will limit speedups. All codes will eventually reach a state
of decreasing returns at some point
M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

53 / 61

Parallel Program Design Considerations

Examples

Parallel Program Design Considerations

Examples

Speedup Limitation Example (MD/NAMD)

Joint Amber-CHARMM Benchmark
NAMD, v2.6b1, u2
1

256
ch_p4
ch_gm

128
79.9

Benchmark MD s/step

0.1

27.2

14.2

7.3
6.46

8
7.01

3.67

0.01

Parallel Speedup

49.6

3.54
2.8

1.9

1.89
1

JAC (Joint Amber-CHARMM) Benchmark: DHFR Protein, 7182 residues, 23558 atoms, 7023

TIP3 waters, PME (9cutoff)

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Parallel Program Design Considerations

Spring 2014

54 / 61

Communication Costs

M. D. Jones, Ph.D. (CCR/UB)

8
32
16
Number of Processors (ppn=2)

128

Brief Overview of Parallel Programming

Parallel Program Design Considerations

Communications

256

Spring 2014

55 / 61

Communication Costs

Communication Considerations
Communication needs between parallel processes affect parallel
programs in several important ways:
Cost: there is always overhead when communicating:

All parallel programs will need some communication between tasks,

but the amount varies considerably:
minimal communication, also known as embarrassingly parallel
- think Monte Carlo calculations
robust communication, in which processes must exchange or
update information frequently - time evolution codes, domain
decomposed finite difference/element applications

latency - the cost in overhead for a zero length message

(microseconds)
bandwidth - the amount of data that can be sent per unit time
(MBytes/second); many applications utilize many small messages
and are latency-bound

Scope: point-to-point communication is always faster than

collective communication (that involves a large subset or all
processes)
Efficiency: are you using the best available resource (network) for
communicating?

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

56 / 61

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

57 / 61

Parallel Program Design Considerations

Communication Costs

Parallel Program Design Considerations

Latency Example - MPI/CCR

Communication Costs

Bandwidth Example - MPI/CCR

MPI PingPong Benchmark Performance

U2 Cluster Interconnects

10000
ch_mx
ch_gm
ch_p4
DAPL-QDR-IB

1000

Bandwidth [MByte/s]

Message Time [s]

100

ch_mx
ch_gm
ch_p4
DAPL-QDR-IB

1 0
10

10
10
Message Length [Bytes]

0.1

Latency on CCR - TCP/IP Ethernet/ch_p4, Myrinet/ch_mx, and QDR

Infiniband.
M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Parallel Program Design Considerations

Spring 2014

58 / 61

10
10
Message Length [Bytes]

0.01

Bandwidth on CCR - TCP/IP Ethernet/ch_p4, Myrinet/ch_mx, and

QDR Infiniband.
M. D. Jones, Ph.D. (CCR/UB)

Communication Costs

Brief Overview of Parallel Programming

Parallel Program Design Considerations

Spring 2014

59 / 61

Communication Costs

Collective Example - MPI/CCR

MPI_Alltoall Benchmark

Barriers - force collective synchronization; often implied, and always

expensive
Locks/Semaphores - typically protect memory locations from
simultaneous/conflicting updates to prevent race conditions

Intel MPI, 4KB Buffer, 12-core nodes with Qlogic QDR IB, Xeon E5645
1e+05

10000
talltoall(4KB) [sec]

Visibility: message-passing codes require the programmer to

explicitly manage all communications, but many data-parallel
codes do not, masking the underlying data transfers
Synchronicity: Synchronous communications are called blocking
as they require an explicit handshake between parallel
processes - asynchronous communications (non-blocking offer
the ability to overlap computation with communication, but place
an extra burden on the programmer. Examples include:

ppn=12
ppn=6
ppn=4
ppn=2
ppn=1

1000

100

100
Number of MPI Processes

1000

Cost of MPI_Alltoall for 4KB buffer on CCR/QDR IB.

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

60 / 61

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

61 / 61

Solved Java
No ratings yet
Solved Java
28 pages
Azure Databricks Best Practices 1664384402
No ratings yet
Azure Databricks Best Practices 1664384402
30 pages
HPC Overview
No ratings yet
HPC Overview
45 pages
Lecture Week - 3 Amdahl Law 1
No ratings yet
Lecture Week - 3 Amdahl Law 1
19 pages
01_Lecture Intro to HPC
No ratings yet
01_Lecture Intro to HPC
62 pages
HPC Fall 2010: Prof. Robert Van Engelen
No ratings yet
HPC Fall 2010: Prof. Robert Van Engelen
35 pages
HPC Lecture (1) Summary
No ratings yet
HPC Lecture (1) Summary
8 pages
Pepper Presentation
No ratings yet
Pepper Presentation
38 pages
Parallel Computing Seminar Report
100% (3)
Parallel Computing Seminar Report
35 pages
Presentation 3
No ratings yet
Presentation 3
63 pages
Lec1 Introduction
No ratings yet
Lec1 Introduction
23 pages
L1.3a HPC Concepts
No ratings yet
L1.3a HPC Concepts
43 pages
Lecture01 Intro ToHPC
No ratings yet
Lecture01 Intro ToHPC
48 pages
Intro Parallel Programming 2015
No ratings yet
Intro Parallel Programming 2015
38 pages
Lecture # 21 (1)
No ratings yet
Lecture # 21 (1)
16 pages
Pc7 Performance
No ratings yet
Pc7 Performance
50 pages
Introduction to Paralel Procesing
No ratings yet
Introduction to Paralel Procesing
40 pages
hpc_parallel
No ratings yet
hpc_parallel
122 pages
ch4 PC
No ratings yet
ch4 PC
76 pages
multicore02-2
No ratings yet
multicore02-2
18 pages
02 Gustafsons Law
No ratings yet
02 Gustafsons Law
2 pages
Lecture 4 Analytical Modeling of Parallel Programs
No ratings yet
Lecture 4 Analytical Modeling of Parallel Programs
11 pages
Principles of Scalable Performance
No ratings yet
Principles of Scalable Performance
61 pages
Parallel Programming: Sathish S. Vadhiyar Course Web Page
No ratings yet
Parallel Programming: Sathish S. Vadhiyar Course Web Page
36 pages
Screenshot 2024-12-05 at 2.01.32 PM
No ratings yet
Screenshot 2024-12-05 at 2.01.32 PM
49 pages
in3200-chap05
No ratings yet
in3200-chap05
34 pages
2 New Module 2 Performance Analysis of Multiprocessor Architectures Students Version
No ratings yet
2 New Module 2 Performance Analysis of Multiprocessor Architectures Students Version
13 pages
Introduction To Parallel Co...
No ratings yet
Introduction To Parallel Co...
44 pages
2.ParallelArchExec
No ratings yet
2.ParallelArchExec
46 pages
Gpu Parallel Program Development Cuda
100% (2)
Gpu Parallel Program Development Cuda
477 pages
Lecture04 PDF
No ratings yet
Lecture04 PDF
27 pages
SMM Cap1
No ratings yet
SMM Cap1
101 pages
Week 7
No ratings yet
Week 7
27 pages
2nd
No ratings yet
2nd
19 pages
High Performance Computing: What Is It Used For and Why?
No ratings yet
High Performance Computing: What Is It Used For and Why?
19 pages
UNIT-2 Parallel Programming Challenges
No ratings yet
UNIT-2 Parallel Programming Challenges
32 pages
SPPU High Performance Computing
No ratings yet
SPPU High Performance Computing
12 pages
Cours 2
No ratings yet
Cours 2
25 pages
02 Introduction
No ratings yet
02 Introduction
46 pages
001__DDS-IIIT-Jan-10th
No ratings yet
001__DDS-IIIT-Jan-10th
34 pages
BDS Session 2
No ratings yet
BDS Session 2
58 pages
Week_7 (1)
No ratings yet
Week_7 (1)
27 pages
Patterson6e_MIPS_Ch06_PPT(2) (1)
No ratings yet
Patterson6e_MIPS_Ch06_PPT(2) (1)
74 pages
HPC Lectures 1 5
No ratings yet
HPC Lectures 1 5
18 pages
Introduction To Parallel Computing Tutorial
No ratings yet
Introduction To Parallel Computing Tutorial
35 pages
Course Outcome 1:: 15Cs4180 - Parallel Computing
No ratings yet
Course Outcome 1:: 15Cs4180 - Parallel Computing
23 pages
04 Progbasics
No ratings yet
04 Progbasics
62 pages
High Performance Computing: 772 10 91 Thomas@chalmers - Se
No ratings yet
High Performance Computing: 772 10 91 Thomas@chalmers - Se
75 pages
CS0051 - Module 01
No ratings yet
CS0051 - Module 01
52 pages
04 - Computer Clusters
No ratings yet
04 - Computer Clusters
66 pages
Principles of Scalable Performance: CSE539: Advanced Computer Architecture
No ratings yet
Principles of Scalable Performance: CSE539: Advanced Computer Architecture
7 pages
Cse570 Zola
No ratings yet
Cse570 Zola
11 pages
Lec 14
No ratings yet
Lec 14
36 pages
Experiment 3
No ratings yet
Experiment 3
5 pages
Introduction To Parallel Computing Tutorial - HPC at LLNL
No ratings yet
Introduction To Parallel Computing Tutorial - HPC at LLNL
46 pages
Lecture1 Notes
No ratings yet
Lecture1 Notes
19 pages
Ca Lecture 11
No ratings yet
Ca Lecture 11
10 pages
Nscet E-Learning Presentation: Listen Learn Lead
No ratings yet
Nscet E-Learning Presentation: Listen Learn Lead
51 pages
Calculating Mpi Pi
No ratings yet
Calculating Mpi Pi
13 pages
Elementary Theory and Application of Numerical Analysis: Revised Edition
From Everand
Elementary Theory and Application of Numerical Analysis: Revised Edition
David G. Moursund
No ratings yet
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
From Everand
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
Fouad Sabry
No ratings yet
The OpenMP Common Core: Making OpenMP Simple Again
From Everand
The OpenMP Common Core: Making OpenMP Simple Again
Timothy G. Mattson
No ratings yet
SE - 2024 - Assignment 10
No ratings yet
SE - 2024 - Assignment 10
7 pages
Python Worksheet 3 Selection 1
No ratings yet
Python Worksheet 3 Selection 1
3 pages
CSC3104 - Lab 2
No ratings yet
CSC3104 - Lab 2
10 pages
Articulation Points
No ratings yet
Articulation Points
4 pages
Message Decoding: Input
No ratings yet
Message Decoding: Input
2 pages
(Common To Cse, It, Ai&Ml, DS) : Computer Networ Web Technology Laboratory Manual
No ratings yet
(Common To Cse, It, Ai&Ml, DS) : Computer Networ Web Technology Laboratory Manual
138 pages
RCX3 e V3.00
No ratings yet
RCX3 e V3.00
412 pages
COPA2ndsemTT English
No ratings yet
COPA2ndsemTT English
246 pages
Comparing Null Values in SQL Server
No ratings yet
Comparing Null Values in SQL Server
2 pages
CONTROL-M for zOS User Guide
No ratings yet
CONTROL-M for zOS User Guide
996 pages
FBAUTO JS BOMUID FREE v1.0.2
No ratings yet
FBAUTO JS BOMUID FREE v1.0.2
13 pages
Numerical Methods
No ratings yet
Numerical Methods
2 pages
Unit 4 - Classes Visual Programming - INF1511
No ratings yet
Unit 4 - Classes Visual Programming - INF1511
13 pages
Operating Systems - CS8493 2017 Regulation - Semester Question Paper 2022 Nov Dec
No ratings yet
Operating Systems - CS8493 2017 Regulation - Semester Question Paper 2022 Nov Dec
4 pages
Program 12 12. Write A Program To Perform Multiplication of Matrices
No ratings yet
Program 12 12. Write A Program To Perform Multiplication of Matrices
5 pages
Using Files From Web Applications - Web APIs - MDN
No ratings yet
Using Files From Web Applications - Web APIs - MDN
17 pages
SAS Presentation
No ratings yet
SAS Presentation
35 pages
1 OOPS
No ratings yet
1 OOPS
63 pages
S7-1200 Programmable Controller - Example - Modbus RTU Master Program
100% (1)
S7-1200 Programmable Controller - Example - Modbus RTU Master Program
3 pages
DAA Quiz Answers
No ratings yet
DAA Quiz Answers
5 pages
Immediate download A Programmer s Guide to Computer Science A virtual degree for the self taught developer Vol 1 1st Edition William M. Springer Ii ebooks 2024
100% (3)
Immediate download A Programmer s Guide to Computer Science A virtual degree for the self taught developer Vol 1 1st Edition William M. Springer Ii ebooks 2024
65 pages
Program To Multiply Two Sparse Matrices Using C Language: Course Code:-Mcs 021 Course Name:-Ds Q.1 Ans
No ratings yet
Program To Multiply Two Sparse Matrices Using C Language: Course Code:-Mcs 021 Course Name:-Ds Q.1 Ans
17 pages
Computer Science Class 12 Sample Question Paper - 1
No ratings yet
Computer Science Class 12 Sample Question Paper - 1
12 pages
Gandhinagar Institute of Technology: B.E. Sem 8 Computer Engineering Department
No ratings yet
Gandhinagar Institute of Technology: B.E. Sem 8 Computer Engineering Department
10 pages
Hashing Interactivepy
No ratings yet
Hashing Interactivepy
11 pages
Networking - Multiple Ping Script in Python - Stack Overflow
No ratings yet
Networking - Multiple Ping Script in Python - Stack Overflow
7 pages
Chapter - 2: Object Oriented System Analysis & Design (OOAD)
No ratings yet
Chapter - 2: Object Oriented System Analysis & Design (OOAD)
49 pages
Module_05_Programming_Model_in_Protected_Mode_Part_1_Addressing
No ratings yet
Module_05_Programming_Model_in_Protected_Mode_Part_1_Addressing
16 pages