0% found this document useful (0 votes)

139 views46 pages

Overview of Parallel Computing: Shawn T. Brown

This document provides an overview of parallel computing, including why it is important, different parallel architectures, and programming models. Some key points: - Parallelism is needed to continue increasing computational power beyond what can be achieved with single processors. This includes multi-core CPUs, computer clusters, and supercomputers. - Popular parallel programming models include shared memory, message passing (e.g. MPI), and one-sided communication like remote memory access. - Languages and libraries have been developed for different parallel architectures, such as OpenMP for shared memory and MPI for distributed memory. Harnessing parallel power requires advanced software development.

Uploaded by

Karthik Kusuma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

139 views46 pages

Overview of Parallel Computing: Shawn T. Brown

Uploaded by

Karthik Kusuma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Overview of Parallel Computing

Shawn T. Brown
Senior Scientific Specialist Pittsburgh Supercomputing Center

Overview:
Why parallel computing? Parallel computing architectures Parallel programming languages Scalability of parallel programs

Why parallel computing?

At some point, building a more powerful computer with a single set of components requires too much effort.

Scientific applications need more !

Parallelism is the way to get more power!

Building more power from smaller units.
Build up in memory, computing, disk to make a computer that is greater than the sum of its parts

The most obvious and useful way to do this is build bigger computers from collections of smaller one
But this is not the only way to exploit parallelism.

Parallelism inside the CPU

On single chips
SSE SIMD instructions
Allows the one CPU unit to execute multiple instances of the same instruction on different data The Opteron has 2-way SSE
Peak performance is 2X the clock rate because it can theoretically perform two operations per cycle

There are chips that already have 4-way SSE, with more coming.

Parallelism in a desktop
This presentation is being given on a parallel computer!
Multi-core chips
Cramming multiple cores in a socket. Allow vendors to provide solutions that offer more computational performance for less cost both in money and in power.

Quad-core chips just starting to come out

AMD - Barcelona Intel - Penryn

Intel announced a few months back an 80 core tiled research architecture, and new MIT startup is making 60 core tiled architectures. Likely to proceed to 16-32 cores per socket in the next 10 years.

Parallel disks
There are lots of applications for which several TB of Data are needed to be stored for analysis. Build a large filesystem from a collection of smaller harddrives.

Parallel Supercomputers
Building larger computers from smaller ones
Connected together by some sort of fast network
Infiniband, Myrinet, Seastar, etc,

Wide variety of architectures

From the small laboratory cluster to biggest supercomputers in the world, parallel computing is the way to get more power!

Shared-Memory Processing
Each processor can access the entire data space
Pros
Easier to program Amenable to automatic parallelism Can be used to run large memory serial programs

Cons
Expensive Difficult to implement on the hardware level Limited number of processors (currently around 512)

Shared-Memory Processing
Programming
OpenMP, Pthreads, Shmem

Columbia (NASA)
20 512 processor Altix computers Combined total of 10,240 processors

Examples
Multiprocessor Desktops
Xeon vs. Opterons Multi-core processors

SGI Altix
Intel Itanium 2 dual core processors linked by the socalled NUMAFlex interconnect Up to 512 processors (1024 cores) sharing up 128 TB of memory

Distributed Memory Machines

Each node in the computer has a locally addressable memory space The computers are connected together via some high-speed network
Infiniband, Myrinet, Giganet, etc..

Pros

Really large machines Cheaper to build and run

Cons
Harder to program More difficult to manage Memory management

Capacity vs. Capability

Capacity computing
Creating large supercomputers to facilitate large throughput of small parallel jobs
Cheaper, slower interconnects Clusters running Linux, OS X, or Windows Easy to build

Capability computing
Creating large supercomputers to enable computation on large scale
Running the entire machine to perform one task Good fast interconnect and balanced performance important Usually specialized hardware and operating systems

Networks

The performance of a distributed memory architecture is highly dependent on the speed and quality of the interconnect.
Latency
The time to send a 0 byte packet of data on the network

Bandwidth
The rate at which a very large packet of information can be sent

Topology
The configuration of the network that determines how many processing units are directly connected.

Networks
Commonly overlooked, important things
How much outstanding data can be on the network at a given time.
Highly scalable codes use asynchronous communication schemes, which require a large amount of data to be on the network at a given time.

Balance
If the either the network or the compute nodes perform way out of proportion, it makes for an unbalanced situation.

Hardware level support

Some routers can support things like network memory and hareware level operations, which can greatly increase performance

Networks
Infiniband, Myrinet, GigE,
Networks that are more designed to run on a small number of processors

Seastar (Cray), Federation (IBM), Constellation (Sun),

Networks designed to scale to tens of thousands of processors.

Clusters
Thunderbird (Sandia National Labs)
Dell PowerEdge Series Capacity Cluster 4096 dual 3.6 Ghz Intel Xeon processors 6 GB DDR-2 RAM per node 4x InfiniBand interconnect

System X (Virginia Tech)

1100 Dual 2.3 GHz PowerPC 970FX processors 4 GB ECC DDR400 (PC3200) RAM 80 GB S-ATA hard disk drive One Mellanox Cougar InfiniBand 4x HCA* Running Mac OS X

MPP (Massively Parallel Processing)

Red Storm (Sandia National Labs)
12,960 Dual Core 2.4 Ghz Opteron processors 4 GB of RAM per processor Proprietary SeaStar interconnect provides machine wide scalability

IBM BlueGene/L (LLNL)

131,072 700 Mhz processors 256 MB or RAM per processor Balanced compute speed with interconnect

There is a catch
Harnessing this increased power requires advanced software development
That is why you are here and interested in parallel computers. Whether it be the PS3 or the CRAY-XT3, writing highly scalable parallel code is a reuie
Multi-core, bigger distributed machines, it is only going to get more difficult for beginning programmers to write highly scalable software. Hackers need not apply!

Parallel Programming Models

Shared Memory
Multiple processors sharing the same memory space

Message Passing
Users make calls that explicitly share information between execution entities

Remote Memory Access

Processors can directly access memory on another processor

These models are then used to build more sophisticated models

Loop Driven Data Parallel Function Driven Parallel (Task-Level)

Shared Memory Programming

SysV memory manipulation
One can actually create, manipulate, shared memory spaces.

Pthreads (Posix Threads)

Lower level Unix library to build multi-threaded programs

OpenMP (www.openmp.org)
Protocol designed to provide automatic parallelization through compiler pragmas. Mainly loop driven parallelism Best suited to desktop and small SMP computers

Caution: Race Conditions

When two threads are changing the same memory location at the same time.

Distributed Memory Programming

No matter what the model, data must be passed from one memory space to the next. Synchronous vs. Asynchronous communication
Whether computation and communication are mutually exclusive

One-sided vs Two-sided
Whether one or both processes involved in the communication process.
two-sided message
message id data payload

one-sided put message

address data payload

network interface

host CPU

memory

Asynchronous and one-sided communication are both the most scalable.

MPI
Message Passing Interface A message-passing library specification
extended message-passing model not a language or compiler specification not a specific implementation or product

MPI is a standard
A list of rules and specifications Left up to individual implementations as to how it is implemented. Virtually all parallel machines in the worlds support an implementation of MPI. Many more sophisticated parallel programming languages are written on top of MPI.

MPI Implementations
Because MPI is a standard, there are several implementations
MPICH - http://www-unix.mcs.anl.gov/mpi/mpich1/
Freely available, portable implementation Available on everything

OpenMPI - http://www.open-mpi.org/
Includes the once popular LAM-MPI

Vendor specific implementations

CRAY, SGI, IBM

Remote Memory Access

Implemented as puts and gets into and out of remote memory locations
Sophisticated under the hood memory management.

MPI-2
Supports one-sided put and gets to remote memory, as well as parallel I/O and dynamic processing

Shmem
Efficient implementation of globally shared pointers and onesided data management Inherent support for atomic memory operations. Also supports collectives, generally with less overhead than MPI

ARMCI Aggregate Remote memory copy interface

A remote memory access interface that is highly portable, supporting many of the features of Shmem, with some optimization features

Partitioned Global Address Space

Global address space

Global address space: any thread/process may directly read/write data allocated by another Partitioned: data is designated as local or global
x: 1 y: l: g: p0 x: 5 y: l: g: p1 x: 7 y: 0 l: g: pn

By default: Object heaps are shared Program stacks are private

SPMD languages: UPC, CAF, and Titanium

All three use an SPMD execution model Emphasis in this talk on UPC and Titanium (based on Java)

Dynamic languages: X10, Fortress, Chapel and Charm++

Slide reproduced with permission from Kathy Yellick (U.C. Berkeley)

Other Powerful languages

Charm++
Object-oriented parallel extension to C++ Run-time engine allows work to be scheduled on the computer. Highly-dynamic, extreme load-balancing capabilities. Completely asynchronous. NAMD, a very popular MD simulation engine is written in Charm++

Other Powerful Languages

Portals
Completely one-sided communication scheme Zero-copy, OS and Application Bypass Designed to have MPI and other languages on top. Intended from the ground up to scale to 10,000s of processors

Now you have written a parallel code how good is it?

Parallel performance is defined in terms of scalability
Scaling for LeanCP (32 Water Molecules at 70 Ry) on BigBen (Cray XT3)
2500

Strong Scalability
Can we get faster for a Problem size.
Scaling
2000

1500 R eal ideal 1000

500

0 0 500 1000 1500 2000 2500 N um ber of Processors

Now you have written a parallel code how good is it?

Parallel performance is defined in terms of scalability

Weak Scalability
How big of a problem can we do?

Memory Scaling
We just looked at Performance Scaling
The speedup in execution time.

When one programs for a distributed architecture, the memory per node is terribly important. Replicated Memory (BAD)
Identical data that must be stored on every processor

Distributed Memory (GOOD)

Data structures that have been broken down and stored across nodes.

Improving Scalability
Serial portions of your code limit the scalability

Amdahls Law
If there is x% of serial component, speedup cannot be better than 100/x. Variants
If you decompose a problem into many parts, then the parallel time cannot be less than the largest of the parts. If the critical path through a computation is T, you cannot complete in less time than T, no matter how many processors you use .

Problem decomposition
A parallel algorithm can only be as fast as the slowest chunk. Very important that one recognize how the algorithm can be broken apart.
Inherent to the algorithm Decisions must be made due to performance.

Communication
Transmitting data between processors takes time. Asynchronous vs. Synchronous
Whether computation can be done while data is on its way to destinations.

Barriers and Synchronization

These say stop and enforce sequential execution in portions of code.

Global vs. Nearest Neighbor Communications

Global communication involves communication with large sets of processors Nearest Neighbor is point to point communication between processors close to each other

Scalable communication
Asynchronous,
Overlap communication and computation to hide the communication time.

nearest-neighbor,
Asymptotically linear as the number of processors grows.

with no barriers,
Stopping computation bad!

the most scalable way to write code.

Communication Strategies for 3D FFT

chunk = all rows with same destination

Three approaches:
Chunk:
Wait for 2nd dim FFTs to finish Minimize # messages

Slab:
Wait for chunk of rows destined for 1 proc to finish Overlap with computation

Pencil:
Send each row as it completes pencil = 1 row Maximize overlap and slab = all rows in a single plane with Match natural layout
same destination
Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea Reproduced with permission from Kathy Yelick (UC Berkeley)

NAS FT Variants Performance Summary

d a e r h T Best MFlop rates FFTW) Chunk (NAS FT withfor all NAS FT Benchmark versions Best MPI (always slabs) Best NAS Fortran/MPI 1000 Best MPI Best UPC (always pencils) Best UPC 800

.5 Tflops

MFlops per Thread

r e p s p o l F M

600

400

200

t 64 256 y rine Ba nd i M Infin

6 3 25 Ela n

2 3 51 Ela n

6 4 25 Ela n

2 4 51 Ela n

Slab is always best for MPI; small message cost too high Myrinet Infiniband Elan3 Elan3 Elan4 Elan4 64 overlap 512 #procs Pencil is always best for UPC; more 256 256 256 512
Reproduced with permission by Kathy Yelick (UC, Berkely)

These ideas make a difference.

Molecular dynamics uses a large 3D FFT to perform the PME procedure. For very large systems, on very large processor counts, pencil decomposition is better. For the Ribosome molecule (2.7 million atoms) on 4096 processors, the pencil decomposition is 30% faster than the slab.

Load imbalance
Your parallel algorithm can only go as fast as its slowest parallel work. Load imbalance occurs when one parallel component has more work to do then others.

Load Balancing
There are strategies to mitigate load balancing Let's look at a loop
for( i = 0; i < N; i++){ do work that scales with N }

There are a couple ways we could divide up the work

Statically
Just divide up the work evenly between processors

Bag of Tasks
The so-called bag of tasks is a way to divide dynamically the work done.
Also called Master/Worker or Server/Client models.

Essentially...
One process acts as the server It divides up initial work amongst the rest of the processes (workers) When a worker is done with its assigned work, it sends back it's processed result. If there is more work to do, the server sends it out. Continue until no work is left.

1 Master 2

Back to example
The previous model is an example of dynamic load balancing.
Providing some means to morph the work distribution to the problem at hand.

Example over 4 processors

Increasing scalability
Minimize serial sections of code
Beat Amdahls law

Minimize communication overhead

Overlap computation and communication with asynchronous communication models Choose algorithms that emphasize nearest neighbor communication Choose the right language for the job!

Dynamic load balancing

Some other tricks of the trade

Plan out your code before hand.
Transforming a serial code to parallel is rarely the best strategy.

Minimize I/O and learn how to use parallel I/O

Very expensive time wise, so use sparingly Do not (and I repeat) do not use scratch files!

Parallel performance is mostly a series of trade-offs

Rarely is there one way to do the right thing.

Otto F. Kernberg - Transtornos Graves de Personalidade
0% (3)
Otto F. Kernberg - Transtornos Graves de Personalidade
58 pages
W3C1 Principles of Parallel Computing
No ratings yet
W3C1 Principles of Parallel Computing
28 pages
Parallel Computing
No ratings yet
Parallel Computing
57 pages
Introduction To Parallel Programming
No ratings yet
Introduction To Parallel Programming
129 pages
Memory in Multiprocessor System
No ratings yet
Memory in Multiprocessor System
52 pages
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 03-Aug-2021 Lecture1-Course Introduction
No ratings yet
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 03-Aug-2021 Lecture1-Course Introduction
39 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
Theory of Distributed Computing and Parallel Processing With Its Applications, Advantages and Disadvantages
No ratings yet
Theory of Distributed Computing and Parallel Processing With Its Applications, Advantages and Disadvantages
11 pages
U1-Theory of Parallelism
No ratings yet
U1-Theory of Parallelism
43 pages
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
No ratings yet
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
22 pages
Parallel_Programming_FDP
No ratings yet
Parallel_Programming_FDP
43 pages
Unit VI Parallel Programming Concepts
No ratings yet
Unit VI Parallel Programming Concepts
90 pages
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
No ratings yet
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
78 pages
CICS 504 Computer Organization
No ratings yet
CICS 504 Computer Organization
35 pages
PDC Complete Course File
No ratings yet
PDC Complete Course File
422 pages
Ece574 - Lec01 - For Chapter 3
No ratings yet
Ece574 - Lec01 - For Chapter 3
34 pages
Project - ParallelComputing BSR v2
No ratings yet
Project - ParallelComputing BSR v2
40 pages
Parallel Programming
No ratings yet
Parallel Programming
5 pages
001 - DDS IIIT Jan 10th
No ratings yet
001 - DDS IIIT Jan 10th
34 pages
Co 1
No ratings yet
Co 1
66 pages
Chapter 4
No ratings yet
Chapter 4
46 pages
PP Cuda Unit1 1
No ratings yet
PP Cuda Unit1 1
77 pages
Parallel and Distributed Computing
No ratings yet
Parallel and Distributed Computing
90 pages
BDS Session 2
No ratings yet
BDS Session 2
56 pages
Parallel Computing
No ratings yet
Parallel Computing
28 pages
KCS 713 Unit 1 Lecture 5
No ratings yet
KCS 713 Unit 1 Lecture 5
32 pages
Supercomputer Architecture: The Teraflops Race
No ratings yet
Supercomputer Architecture: The Teraflops Race
28 pages
Parallel Computing: Overview: John Urbanic Urbanic@psc - Edu
No ratings yet
Parallel Computing: Overview: John Urbanic Urbanic@psc - Edu
34 pages
Introduction To Parallel Computing LLNL
No ratings yet
Introduction To Parallel Computing LLNL
44 pages
2 ParallelArchExec
No ratings yet
2 ParallelArchExec
46 pages
Lec1 and 2
No ratings yet
Lec1 and 2
52 pages
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
No ratings yet
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
26 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
unit 1
No ratings yet
unit 1
25 pages
Whitepaper Imsl Increase Performance Parallel Programming Numerical Libraries
No ratings yet
Whitepaper Imsl Increase Performance Parallel Programming Numerical Libraries
8 pages
Cloud Computing - Lecture 3
No ratings yet
Cloud Computing - Lecture 3
22 pages
Lecture-13-14 Parallel and Distributed Systems Programming Models-Jameel
No ratings yet
Lecture-13-14 Parallel and Distributed Systems Programming Models-Jameel
70 pages
Lecture Week - 1 Introduction 1 - SP-24
No ratings yet
Lecture Week - 1 Introduction 1 - SP-24
51 pages
Intro To Parallel Computing
No ratings yet
Intro To Parallel Computing
127 pages
HPC Lectures 1 5
No ratings yet
HPC Lectures 1 5
18 pages
Multi-Core Programming - Increasing Performance Through Software Multi-Threading
No ratings yet
Multi-Core Programming - Increasing Performance Through Software Multi-Threading
11 pages
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
No ratings yet
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
26 pages
Lecture 1 Introduction 1
No ratings yet
Lecture 1 Introduction 1
49 pages
Unit 5
No ratings yet
Unit 5
66 pages
Intro HPC IITK
No ratings yet
Intro HPC IITK
44 pages
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 05-Aug-2021 Module1 (Part 1)
No ratings yet
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 05-Aug-2021 Module1 (Part 1)
30 pages
CSCE569 Parallel Computing: TTH 03:30AM-04:45PM Dr. Jianjun Hu
No ratings yet
CSCE569 Parallel Computing: TTH 03:30AM-04:45PM Dr. Jianjun Hu
37 pages
Parallel Computing 1 Unit
No ratings yet
Parallel Computing 1 Unit
59 pages
Parallel Distributed Computing
No ratings yet
Parallel Distributed Computing
64 pages
Cloud Computing
No ratings yet
Cloud Computing
30 pages
2-INTRODUCTION TO PDC - MOTIVATION - KEY CONCEPTS-03-Dec-2019Material - I - 03-Dec-2019 - Module - 1 PDF
No ratings yet
2-INTRODUCTION TO PDC - MOTIVATION - KEY CONCEPTS-03-Dec-2019Material - I - 03-Dec-2019 - Module - 1 PDF
63 pages
01 Introduction
No ratings yet
01 Introduction
32 pages
Lecture Notes On Parallel Computation
No ratings yet
Lecture Notes On Parallel Computation
30 pages
Intro Parallel Computing PDF
No ratings yet
Intro Parallel Computing PDF
58 pages
1 Introduction
No ratings yet
1 Introduction
48 pages
CMP 252 - Parallelism Fundamentals
No ratings yet
CMP 252 - Parallelism Fundamentals
64 pages
lect1-parallel system
No ratings yet
lect1-parallel system
52 pages
HPC BOOk
No ratings yet
HPC BOOk
68 pages
Lecture1
No ratings yet
Lecture1
38 pages
SOA Notes
No ratings yet
SOA Notes
60 pages
AWS Solutions Architect - Associate Level: Lesson 0: Course Overview
No ratings yet
AWS Solutions Architect - Associate Level: Lesson 0: Course Overview
13 pages
Exponential Transformation Guide
No ratings yet
Exponential Transformation Guide
11 pages
Oracle Flex Cluster 12c
No ratings yet
Oracle Flex Cluster 12c
8 pages
Cse 803 Final
No ratings yet
Cse 803 Final
91 pages
CC Viva
No ratings yet
CC Viva
13 pages
An Insider's Guide To Object Storage
No ratings yet
An Insider's Guide To Object Storage
17 pages
BDA Question Bank With Solutions
No ratings yet
BDA Question Bank With Solutions
88 pages
DeltaV Software License Types
No ratings yet
DeltaV Software License Types
3 pages
Agileload Jboss Performance Tuning Whitepaper
No ratings yet
Agileload Jboss Performance Tuning Whitepaper
11 pages
MOUNT
No ratings yet
MOUNT
47 pages
Sangfor HCI
No ratings yet
Sangfor HCI
8 pages
DEA-1TT4.prepaway - Premium.exam.114q: Number: DEA-1TT4 Passing Score: 800 Time Limit: 120 Min File Version: 2.0
67% (3)
DEA-1TT4.prepaway - Premium.exam.114q: Number: DEA-1TT4 Passing Score: 800 Time Limit: 120 Min File Version: 2.0
42 pages
Building Advanced AI Agent Systems: From Fundamentals To Scalable Architecture
No ratings yet
Building Advanced AI Agent Systems: From Fundamentals To Scalable Architecture
18 pages
Petex DOF Brochure
No ratings yet
Petex DOF Brochure
15 pages
Model 204: Translate
No ratings yet
Model 204: Translate
1 page
ACA Outline
No ratings yet
ACA Outline
5 pages
Chapter3 - Egovernment Infrastructure Development
No ratings yet
Chapter3 - Egovernment Infrastructure Development
101 pages
WhitePaper Adding Speed and Scale To PostgreSQL
No ratings yet
WhitePaper Adding Speed and Scale To PostgreSQL
11 pages
FLEET: High-Performance Durable Replicated State Machines Using Scattered and Coordinated Log Entries
No ratings yet
FLEET: High-Performance Durable Replicated State Machines Using Scattered and Coordinated Log Entries
14 pages
Chapter 4-1
No ratings yet
Chapter 4-1
16 pages
3 History of Cloud Computing
No ratings yet
3 History of Cloud Computing
4 pages
Tivoli Netcool OMNIbus 7.3.1 Large Scale and Geographically Distributed Architectures - Best Practice - v1.0
No ratings yet
Tivoli Netcool OMNIbus 7.3.1 Large Scale and Geographically Distributed Architectures - Best Practice - v1.0
23 pages
The Future of Databases in India - TiDB's Answer To Modern Data Challenges
No ratings yet
The Future of Databases in India - TiDB's Answer To Modern Data Challenges
6 pages
Microsoft Azure Fundamentals
No ratings yet
Microsoft Azure Fundamentals
16 pages
Unstructured Data Storage
No ratings yet
Unstructured Data Storage
5 pages
Graphql Vs Rest Api
No ratings yet
Graphql Vs Rest Api
13 pages
No SQL PR 1 & 2
No ratings yet
No SQL PR 1 & 2
38 pages
Class 12 Computer Model Question Solution With Free PDF 2080 - The SR Zone
No ratings yet
Class 12 Computer Model Question Solution With Free PDF 2080 - The SR Zone
23 pages

Overview of Parallel Computing: Shawn T. Brown

Uploaded by

Overview of Parallel Computing: Shawn T. Brown

Uploaded by

Overview of Parallel Computing

Why parallel computing?

Scientific applications need more !

Parallelism is the way to get more power!

Parallelism inside the CPU

Quad-core chips just starting to come out

Wide variety of architectures

Distributed Memory Machines

Really large machines Cheaper to build and run

Capacity vs. Capability

Hardware level support

Seastar (Cray), Federation (IBM), Constellation (Sun),

System X (Virginia Tech)

MPP (Massively Parallel Processing)

IBM BlueGene/L (LLNL)

Parallel Programming Models

Remote Memory Access

These models are then used to build more sophisticated models

Shared Memory Programming

Pthreads (Posix Threads)

Caution: Race Conditions

Distributed Memory Programming

one-sided put message

Asynchronous and one-sided communication are both the most scalable.

Vendor specific implementations

Remote Memory Access

ARMCI Aggregate Remote memory copy interface

Partitioned Global Address Space

By default: Object heaps are shared Program stacks are private

SPMD languages: UPC, CAF, and Titanium

Dynamic languages: X10, Fortress, Chapel and Charm++

Other Powerful languages

Other Powerful Languages

Now you have written a parallel code how good is it?

1500 R eal ideal 1000

0 0 500 1000 1500 2000 2500 N um ber of Processors

Now you have written a parallel code how good is it?

Distributed Memory (GOOD)

Barriers and Synchronization

Global vs. Nearest Neighbor Communications

the most scalable way to write code.

Communication Strategies for 3D FFT

NAS FT Variants Performance Summary

MFlops per Thread

t 64 256 y rine Ba nd i M Infin

These ideas make a difference.

There are a couple ways we could divide up the work

Example over 4 processors

Minimize communication overhead

Dynamic load balancing

Some other tricks of the trade

Minimize I/O and learn how to use parallel I/O

Parallel performance is mostly a series of trade-offs

You might also like