0% found this document useful (0 votes)
28 views

Parallel Computing: Overview: John Urbanic Urbanic@psc - Edu

Parallel computing uses multiple processors working together to solve problems faster than a single processor can. There are two main approaches: shared memory, where processors access the same memory, and distributed memory, where each processor has its own local memory and they communicate over a network. Performance depends on latency, the time for messages to be sent, and bandwidth, the data transfer rate. Popular architectures include clusters, grids, torus networks, and fat trees. Programming models include data parallel, which splits arrays across processors, and work sharing, which distributes loops. Load balancing aims to distribute work evenly for best performance.

Uploaded by

anjnasharma
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Parallel Computing: Overview: John Urbanic Urbanic@psc - Edu

Parallel computing uses multiple processors working together to solve problems faster than a single processor can. There are two main approaches: shared memory, where processors access the same memory, and distributed memory, where each processor has its own local memory and they communicate over a network. Performance depends on latency, the time for messages to be sent, and bandwidth, the data transfer rate. Popular architectures include clusters, grids, torus networks, and fat trees. Programming models include data parallel, which splits arrays across processors, and work sharing, which distributes loops. Load balancing aims to distribute work evenly for best performance.

Uploaded by

anjnasharma
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 34

Parallel Computing: Overview

John Urbanic
[email protected]
Introduction to Parallel
Computing
• Why we need parallel computing
• How such machines are built
• How we actually use these machines
New Applications
Clock Speeds
CPU Clock
Clock Speeds
When the PSC went from a 2.7 GFlop Y-MP to a 16
GFlop C90, the clock only got 50% faster. The
rest of the speed increase was due to increased use
of parallel techniques:
• More processors (8  16)
• Longer vector pipes (64  128)
• Parallel functional units (2)
• Cray X1 (13 GFlops/CPU) is only 800 MHz!
Clock Speeds
So, we want as many processors working
together as possible. How do we do this?
There are two distinct elements:
Hardware
• vendor does this
Software
• you, at least today
Amdahl’s Law
How many processors can
we really use?

Let’s say we have a legacy


code such that is it only
feasible to convert half of
the heavily used routines
to parallel:
Amdahl’s Law
If we run this on a parallel
machine with five
processors:
Our code now takes about
60s. We have sped it up
by about 40%. Let’s say
we use a thousand
processors:
We have now sped our code
by about a factor of two.
Amdahl’s Law
This seems pretty depressing, and it does point out one limitation of converting old
codes one subroutine at a time. However, most new codes, and almost all parallel
algorithms, can be written almost entirely in parallel (usually, the “start up” or
initial input I/O code is the exception), resulting in significant practical speed ups.
This can be quantified by how well a code scales which is often measured as
efficiency.
Shared Memory
Easiest to program. There are no
real data distribution or
communication issues. Why
doesn’t everyone use this
scheme?
• Limited numbers of processors
(tens) – Only so many C90(60)/Marvel(64)/Altix(12)
processors can share the same
bus before conflicts dominate.
• Limited memory size –
Memory shares bus as well.
Accessing one part of memory
will interfere with access to
other parts.
Distributed Memory
• Number of processors only limited by physical
size (tens of meters).
• Memory only limited by the number of processors
time the maximum memory per processor (very
large). However, physical packaging usually
dictates no local disk per node and hence no
virtual memory.
• Since local and remote data have much different
access times, data distribution is very important.
We must minimize communication.
Common Distributed Memory
Machines
• CM-2
• CM-5
• T3E
• Workstation Cluster
• SP4
• TCS
• ASCI (Red, Blue, Purple, White, Red Storm…)
• Earth Simulator
Common Distributed Memory
Machines
While the CM-2 is SIMD (one instruction unit for multiple processors),
all the new machines are MIMD (multiple instructions for multiple
processors) and based on commodity processors.

SP-4 POWER4
CM-5 SPARC
T3E Alpha
Workstations Mostly Intel and AMD
TCS Alpha

Therefore, the single most defining characteristic of any of these


machines is probably the network.
Latency and Bandwidth
Even with the "perfect" network we have here, performance is determined by two more quantities that,
together with the topologies we'll look at, pretty much define the network: latency and bandwidth.
Latency can nicely be defined as the time required to send a message with 0 bytes of data. This number
often reflects either the overhead of packing your data into packets, or the delays in making
intervening hops across the network between two nodes that aren't next to each other.
Bandwidth is the rate at which very large packets of information can be sent. If there was no latency, this is
the rate at which all data would be transferred. It often reflects the physical capability of the wires and
electronics connecting nodes.
Token-Ring/Ethernet with
Workstations
Complete Connectivity
Super Cluster / SP4
CM-2
Binary Tree
CM-5 Fat Tree
INTEL Paragon (2-D Mesh)
3-D Torus
T3E has Global
Addressing hardware,
and this helps to
simulate shared
memory.
Torus means that “ends”
are connected. This
means A is really
connected to B and the
cube has no real
boundary.
TCS Fat Tree
Data Parallel
Only one executable. Strengths:
Do computation on arrays of data using 1. Scales transparently to different
array operators. size machines
Do communications using array shift
2. Easy debugging, as there I sonly
or rearrangement operators.
one copy of coed executing in
Good for problems with static load highly synchronized fashion
balancing that are array-oriented
SIMD machines. Weaknesses:
Variants: 1. Much wasted synchronization
FORTRAN 90 2. Difficult to balance load
CM FORTRAN
HPF
C*
CRAFT
Data Parallel – Cont’d
Computation in FORTRAN 90
Data Parallel – Cont’d
Communication in FORTRAN 90
Data Parallel – Cont’d
When to use Data Parallel
– Very array-oriented programs
• FEA
• Fluid Dynamics
• Neural Nets
• Weather Modeling
– Very synchronized operations
• Image processing
• Math analysis
Work Sharing
Splits up tasks (as opposed to arrays Strengths:
in date parallel) such as loops 1. Directive based, so it can be
amongst separate processors. added to existing serial codes
Do computation on loops that are Weaknesses:
automatically distributed.
1. Limited flexibility
Do communication as a side effect of 2. Efficiency dependent upon
data loop distribution. Not
structure of existing serial code
important on shared memory
machines. 3. May be very poor with
distributed memory.
If you have used CRAYs before, this
of this as “advanced Variants:
multitasking.” * CRAFT
Good for shared memory * Multitasking
implementations. * OpenMP
Work Sharing – Cont’d
When to use Work Sharing
• Very large / complex / old existing codes:
Gaussian 90
• Already multitasked codes: Charmm
• Portability (Directive Based)
• (Not Recommended)
Load Balancing
An important consideration which can be controlled by communication is
load balancing:
Consider the case where a dataset is distributed evenly over 4 sites.
Each site will run a piece of code which uses the data as input and
attempts to find a convergence. It is possible that the data contained at
sites 0, 2, and 3 may converge much faster than the data at site 1. If
this is the case, the three sites which finished first will remain idle
while site 1 finishes. When attempting to balance the amount of work
being done at each site, one must take into account the speed of the
processing site, the communication "expense" of starting and
coordinating separate pieces of work, and the amount of work required
by various pieces of data.
There are two forms of load balancing: static and dynamic.
Load Balancing – Cont’d
Static Load Balancing
In static load balancing, the programmer must
make a decision and assign a fixed amount of
work to each processing site a priori.
Static load balancing can be used in either the
Master-Slave (Host-Node) programming model
or the "Hostless" programming model.
Load Balancing – Cont’d
Static Load Balancing yields good performance
when:
• homogeneous cluster
• each processing site has an equal amount of work
Poor performance when:
• heterogeneous cluster where some processors are
much faster (unless this is taken into account in
the program design)
• work distribution is uneven
Load Balancing – Cont’d
Dynamic Load Balancing
Dynamic load balancing can be further divided into the categories:
task-oriented
when one processing site finishes its task, it is assigned another task (this is the
most commonly used form).
data-oriented
when one processing site finishes its task before other sites, the site with the most
work gives the idle site some of its data to process (this is much more complicated
because it requires an extensive amount of bookkeeping).
Dynamic load balancing can be used only in the Master-Slave programming model.
• ideal for:
• codes where tasks are large enough to keep each processing site busy
• codes where work is uneven
• heterogeneous clusters

You might also like