Overview of Parallel Computing
Shawn T. Brown
Senior Scientific Specialist Pittsburgh Supercomputing Center
Overview:
Why parallel computing? Parallel computing architectures Parallel programming languages Scalability of parallel programs
Why parallel computing?
At some point, building a more powerful computer with a single set of components requires too much effort.
Scientific applications need more !
Parallelism is the way to get more power!
Building more power from smaller units.
Build up in memory, computing, disk to make a computer that is greater than the sum of its parts
The most obvious and useful way to do this is build bigger computers from collections of smaller one
But this is not the only way to exploit parallelism.
Parallelism inside the CPU
On single chips
SSE SIMD instructions
Allows the one CPU unit to execute multiple instances of the same instruction on different data The Opteron has 2-way SSE
Peak performance is 2X the clock rate because it can theoretically perform two operations per cycle
There are chips that already have 4-way SSE, with more coming.
Parallelism in a desktop
This presentation is being given on a parallel computer!
Multi-core chips
Cramming multiple cores in a socket. Allow vendors to provide solutions that offer more computational performance for less cost both in money and in power.
Quad-core chips just starting to come out
AMD - Barcelona Intel - Penryn
Intel announced a few months back an 80 core tiled research architecture, and new MIT startup is making 60 core tiled architectures. Likely to proceed to 16-32 cores per socket in the next 10 years.
Parallel disks
There are lots of applications for which several TB of Data are needed to be stored for analysis. Build a large filesystem from a collection of smaller harddrives.
Parallel Supercomputers
Building larger computers from smaller ones
Connected together by some sort of fast network
Infiniband, Myrinet, Seastar, etc,
Wide variety of architectures
From the small laboratory cluster to biggest supercomputers in the world, parallel computing is the way to get more power!
Shared-Memory Processing
Each processor can access the entire data space
Pros
Easier to program Amenable to automatic parallelism Can be used to run large memory serial programs
Cons
Expensive Difficult to implement on the hardware level Limited number of processors (currently around 512)
Shared-Memory Processing
Programming
OpenMP, Pthreads, Shmem
Columbia (NASA)
20 512 processor Altix computers Combined total of 10,240 processors
Examples
Multiprocessor Desktops
Xeon vs. Opterons Multi-core processors
SGI Altix
Intel Itanium 2 dual core processors linked by the socalled NUMAFlex interconnect Up to 512 processors (1024 cores) sharing up 128 TB of memory
Distributed Memory Machines
Each node in the computer has a locally addressable memory space The computers are connected together via some high-speed network
Infiniband, Myrinet, Giganet, etc..
Pros
Really large machines Cheaper to build and run
Cons
Harder to program More difficult to manage Memory management
Capacity vs. Capability
Capacity computing
Creating large supercomputers to facilitate large throughput of small parallel jobs
Cheaper, slower interconnects Clusters running Linux, OS X, or Windows Easy to build
Capability computing
Creating large supercomputers to enable computation on large scale
Running the entire machine to perform one task Good fast interconnect and balanced performance important Usually specialized hardware and operating systems
Networks
The performance of a distributed memory architecture is highly dependent on the speed and quality of the interconnect.
Latency
The time to send a 0 byte packet of data on the network
Bandwidth
The rate at which a very large packet of information can be sent
Topology
The configuration of the network that determines how many processing units are directly connected.
Networks
Commonly overlooked, important things
How much outstanding data can be on the network at a given time.
Highly scalable codes use asynchronous communication schemes, which require a large amount of data to be on the network at a given time.
Balance
If the either the network or the compute nodes perform way out of proportion, it makes for an unbalanced situation.
Hardware level support
Some routers can support things like network memory and hareware level operations, which can greatly increase performance
Networks
Infiniband, Myrinet, GigE,
Networks that are more designed to run on a small number of processors
Seastar (Cray), Federation (IBM), Constellation (Sun),
Networks designed to scale to tens of thousands of processors.
Clusters
Thunderbird (Sandia National Labs)
Dell PowerEdge Series Capacity Cluster 4096 dual 3.6 Ghz Intel Xeon processors 6 GB DDR-2 RAM per node 4x InfiniBand interconnect
System X (Virginia Tech)
1100 Dual 2.3 GHz PowerPC 970FX processors 4 GB ECC DDR400 (PC3200) RAM 80 GB S-ATA hard disk drive One Mellanox Cougar InfiniBand 4x HCA* Running Mac OS X
MPP (Massively Parallel Processing)
Red Storm (Sandia National Labs)
12,960 Dual Core 2.4 Ghz Opteron processors 4 GB of RAM per processor Proprietary SeaStar interconnect provides machine wide scalability
IBM BlueGene/L (LLNL)
131,072 700 Mhz processors 256 MB or RAM per processor Balanced compute speed with interconnect
There is a catch
Harnessing this increased power requires advanced software development
That is why you are here and interested in parallel computers. Whether it be the PS3 or the CRAY-XT3, writing highly scalable parallel code is a reuie
Multi-core, bigger distributed machines, it is only going to get more difficult for beginning programmers to write highly scalable software. Hackers need not apply!
Parallel Programming Models
Shared Memory
Multiple processors sharing the same memory space
Message Passing
Users make calls that explicitly share information between execution entities
Remote Memory Access
Processors can directly access memory on another processor
These models are then used to build more sophisticated models
Loop Driven Data Parallel Function Driven Parallel (Task-Level)
Shared Memory Programming
SysV memory manipulation
One can actually create, manipulate, shared memory spaces.
Pthreads (Posix Threads)
Lower level Unix library to build multi-threaded programs
OpenMP (www.openmp.org)
Protocol designed to provide automatic parallelization through compiler pragmas. Mainly loop driven parallelism Best suited to desktop and small SMP computers
Caution: Race Conditions
When two threads are changing the same memory location at the same time.
Distributed Memory Programming
No matter what the model, data must be passed from one memory space to the next. Synchronous vs. Asynchronous communication
Whether computation and communication are mutually exclusive
One-sided vs Two-sided
Whether one or both processes involved in the communication process.
two-sided message
message id data payload
one-sided put message
address data payload
network interface
host CPU
memory
Asynchronous and one-sided communication are both the most scalable.
MPI
Message Passing Interface A message-passing library specification
extended message-passing model not a language or compiler specification not a specific implementation or product
MPI is a standard
A list of rules and specifications Left up to individual implementations as to how it is implemented. Virtually all parallel machines in the worlds support an implementation of MPI. Many more sophisticated parallel programming languages are written on top of MPI.
MPI Implementations
Because MPI is a standard, there are several implementations
MPICH - http://www-unix.mcs.anl.gov/mpi/mpich1/
Freely available, portable implementation Available on everything
OpenMPI - http://www.open-mpi.org/
Includes the once popular LAM-MPI
Vendor specific implementations
CRAY, SGI, IBM
Remote Memory Access
Implemented as puts and gets into and out of remote memory locations
Sophisticated under the hood memory management.
MPI-2
Supports one-sided put and gets to remote memory, as well as parallel I/O and dynamic processing
Shmem
Efficient implementation of globally shared pointers and onesided data management Inherent support for atomic memory operations. Also supports collectives, generally with less overhead than MPI
ARMCI Aggregate Remote memory copy interface
A remote memory access interface that is highly portable, supporting many of the features of Shmem, with some optimization features
Partitioned Global Address Space
Global address space
Global address space: any thread/process may directly read/write data allocated by another Partitioned: data is designated as local or global
x: 1 y: l: g: p0 x: 5 y: l: g: p1 x: 7 y: 0 l: g: pn
By default: Object heaps are shared Program stacks are private
SPMD languages: UPC, CAF, and Titanium
All three use an SPMD execution model Emphasis in this talk on UPC and Titanium (based on Java)
Dynamic languages: X10, Fortress, Chapel and Charm++
Slide reproduced with permission from Kathy Yellick (U.C. Berkeley)
Other Powerful languages
Charm++
Object-oriented parallel extension to C++ Run-time engine allows work to be scheduled on the computer. Highly-dynamic, extreme load-balancing capabilities. Completely asynchronous. NAMD, a very popular MD simulation engine is written in Charm++
Other Powerful Languages
Portals
Completely one-sided communication scheme Zero-copy, OS and Application Bypass Designed to have MPI and other languages on top. Intended from the ground up to scale to 10,000s of processors
Now you have written a parallel code how good is it?
Parallel performance is defined in terms of scalability
Scaling for LeanCP (32 Water Molecules at 70 Ry) on BigBen (Cray XT3)
2500
Strong Scalability
Can we get faster for a Problem size.
Scaling
2000
1500 R eal ideal 1000
500
0 0 500 1000 1500 2000 2500 N um ber of Processors
Now you have written a parallel code how good is it?
Parallel performance is defined in terms of scalability
Weak Scalability
How big of a problem can we do?
Memory Scaling
We just looked at Performance Scaling
The speedup in execution time.
When one programs for a distributed architecture, the memory per node is terribly important. Replicated Memory (BAD)
Identical data that must be stored on every processor
Distributed Memory (GOOD)
Data structures that have been broken down and stored across nodes.
Improving Scalability
Serial portions of your code limit the scalability
Amdahls Law
If there is x% of serial component, speedup cannot be better than 100/x. Variants
If you decompose a problem into many parts, then the parallel time cannot be less than the largest of the parts. If the critical path through a computation is T, you cannot complete in less time than T, no matter how many processors you use .
Problem decomposition
A parallel algorithm can only be as fast as the slowest chunk. Very important that one recognize how the algorithm can be broken apart.
Inherent to the algorithm Decisions must be made due to performance.
Communication
Transmitting data between processors takes time. Asynchronous vs. Synchronous
Whether computation can be done while data is on its way to destinations.
Barriers and Synchronization
These say stop and enforce sequential execution in portions of code.
Global vs. Nearest Neighbor Communications
Global communication involves communication with large sets of processors Nearest Neighbor is point to point communication between processors close to each other
Scalable communication
Asynchronous,
Overlap communication and computation to hide the communication time.
nearest-neighbor,
Asymptotically linear as the number of processors grows.
with no barriers,
Stopping computation bad!
the most scalable way to write code.
Communication Strategies for 3D FFT
chunk = all rows with same destination
Three approaches:
Chunk:
Wait for 2nd dim FFTs to finish Minimize # messages
Slab:
Wait for chunk of rows destined for 1 proc to finish Overlap with computation
Pencil:
Send each row as it completes pencil = 1 row Maximize overlap and slab = all rows in a single plane with Match natural layout
same destination
Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea Reproduced with permission from Kathy Yelick (UC Berkeley)
NAS FT Variants Performance Summary
d a e r h T Best MFlop rates FFTW) Chunk (NAS FT withfor all NAS FT Benchmark versions Best MPI (always slabs) Best NAS Fortran/MPI 1000 Best MPI Best UPC (always pencils) Best UPC 800
.5 Tflops
MFlops per Thread
r e p s p o l F M
600
400
200
t 64 256 y rine Ba nd i M Infin
6 3 25 Ela n
2 3 51 Ela n
6 4 25 Ela n
2 4 51 Ela n
Slab is always best for MPI; small message cost too high Myrinet Infiniband Elan3 Elan3 Elan4 Elan4 64 overlap 512 #procs Pencil is always best for UPC; more 256 256 256 512
Reproduced with permission by Kathy Yelick (UC, Berkely)
These ideas make a difference.
Molecular dynamics uses a large 3D FFT to perform the PME procedure. For very large systems, on very large processor counts, pencil decomposition is better. For the Ribosome molecule (2.7 million atoms) on 4096 processors, the pencil decomposition is 30% faster than the slab.
Load imbalance
Your parallel algorithm can only go as fast as its slowest parallel work. Load imbalance occurs when one parallel component has more work to do then others.
Load Balancing
There are strategies to mitigate load balancing Let's look at a loop
for( i = 0; i < N; i++){ do work that scales with N }
There are a couple ways we could divide up the work
Statically
Just divide up the work evenly between processors
Bag of Tasks
The so-called bag of tasks is a way to divide dynamically the work done.
Also called Master/Worker or Server/Client models.
Essentially...
One process acts as the server It divides up initial work amongst the rest of the processes (workers) When a worker is done with its assigned work, it sends back it's processed result. If there is more work to do, the server sends it out. Continue until no work is left.
1 Master 2
Back to example
The previous model is an example of dynamic load balancing.
Providing some means to morph the work distribution to the problem at hand.
Example over 4 processors
Increasing scalability
Minimize serial sections of code
Beat Amdahls law
Minimize communication overhead
Overlap computation and communication with asynchronous communication models Choose algorithms that emphasize nearest neighbor communication Choose the right language for the job!
Dynamic load balancing
Some other tricks of the trade
Plan out your code before hand.
Transforming a serial code to parallel is rarely the best strategy.
Minimize I/O and learn how to use parallel I/O
Very expensive time wise, so use sparingly Do not (and I repeat) do not use scratch files!
Parallel performance is mostly a series of trade-offs
Rarely is there one way to do the right thing.