0% found this document useful (0 votes)
31 views

Distributed Memory Machines

This document discusses distributed memory machines. It begins by describing the basic architecture of distributed memory machines, where each processor is connected to its own local memory and communicates with other processors via a network interface and interconnect. It then discusses several key issues in distributed memory machines including the design of the network interface, interconnect topology, routing algorithms, switching strategies, and avoiding deadlocks. Several common interconnect topologies are described such as meshes, tori, hypercubes, and butterflies along with their properties. The document provides an overview of routing algorithms, switching strategies, and techniques for avoiding deadlocks in distributed memory machines.

Uploaded by

Aykut Saraylı
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

Distributed Memory Machines

This document discusses distributed memory machines. It begins by describing the basic architecture of distributed memory machines, where each processor is connected to its own local memory and communicates with other processors via a network interface and interconnect. It then discusses several key issues in distributed memory machines including the design of the network interface, interconnect topology, routing algorithms, switching strategies, and avoiding deadlocks. Several common interconnect topologies are described such as meshes, tori, hypercubes, and butterflies along with their properties. The document provides an overview of routing algorithms, switching strategies, and techniques for avoiding deadlocks in distributed memory machines.

Uploaded by

Aykut Saraylı
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Distributed Memory Machines

Distributed Memory Machines

Intel Paragon, Cray T3E, IBM SP


Each processor is connected to its own memory and
cache:

cannot directly access another processors memory.

Each node has a network interface (NI) for all


communication and synchronization

Arvind Krishnamurthy
Fall 2004

Key issues: design of NI and interconnection topology

P1

NI

P2

memory

memory

NI

Pn
...

NI

memory

interconnect

Historical Perspective

Early machines were:


Collection of microprocessors
bi-directional queues between neighbors
Messages were forwarded by processors on path
Strong emphasis on topology in algorithms

Network Analogy

To have a large number of transfers occurring at once, you


need a large number of distinct wires
Networks are like streets

link = street
switch = intersection
distances (hops) = number of blocks traveled
routing algorithm = travel plans

Important Properties:

latency: how long to get somewhere in the network


bandwidth: how much data can be moved per unit time
limited by the number of wires
and the rate at which each wire can accept data

Network Characteristics

Topology - how things are connected

two types of nodes: hosts and switches


Question: what nice properties do we want the network topology to
possess?

Topology Properties

Routing algorithm - paths used

Effective bandwidth lower due to packet overhead

Bisection bandwidth

how data in a message traverses a route


circuit switching vs. packet switching

Flow control - what if there is congestion

if two or more messages attempt to use the same channel


may stall, move to buffers, reroute, discard, etc.

Routing
and control
header

w is the number of wires


t is the time per bit
Data
payload

Error code

Switching strategy

Trailer

e.g., all east-west then all north-south in a mesh

Routing Distance - number of links on route. Minimize average


distance
Diameter is the maximum shortest path between two nodes
A network is partitioned if some nodes cannot reach others
The bandwidth of a link is: w * 1/t

sum of the minimum number of channels which, if removed, will partition


the network

Linear and Ring Topologies

Meshes and Tori

Linear array

2D Mesh:

diameter is n-1, average distance ~2/3n


bisection bandwidth is 1

Torus or Ring

Used in algorithms with 1D arrays

Diameter: 2 n
Bisection bandwidth: n

diameter is n/2, average distance is n/3


bisection bandwidth is 2

Generalizes to 3D and higher dimensions


Cray T3D/T3E uses a 3D torus
Often easy to implement algorithms that use 2D-3D arrays

Hypercubes

Trees

Number of nodes n = 2d for dimension d

Diameter: d
Bisection bandwidth is n/2

Popular in early machines (Intel iPSC, NCUBE)

Diameter: log n
Bisection bandwidth: 1
Easy layout as planar graph
Many tree algorithms (summation)
Fat trees avoid bisection bandwidth problem

Lots of clever algorithms

Greycode addressing

each node connected to d others with 1 bit different


110
010

111
011

100
000

101
001

Butterflies

more (or wider) links near top


example, Thinking Machines CM-5

Butterfly building block


Diameter: log n
Bisection bandwidth: n
Cost: lots of wires
Use in BBN Butterfly
Natural for FFT

Outline
O

Interconnection network issues:


Topology characteristics
Average routing distance
Diameter (maximum routing distance)
Bisection bandwidth
Link, switch design
Switching
Packet switching vs. circuit switching
Store-&-forward vs. cut-through routing
Routing

Link Design/Engineering Space

Switches

Cable of one or more wires/fibers with connectors at the


ends attached to switches or interfaces
Narrow:
- control, data and timing
multiplexed on wire

Short:
- single logical
value at a time

Input
Buffer

Receiver
Input
Ports

Synchronous:
- source & dest on same
clock

Output
Buffer Transmiter

Output
Ports

Cross-bar

Long:
- stream of logical
values at a time

Control
Routing, Scheduling
Asynchronous:
- source encodes clock in
signal

Wide:
- control, data and timing
on separate wires

Switch Components

Output ports

synchronizer aligns data signal with local clock domain


essentially FIFO buffer

Crossbar

circuit switching: full path reserved for entire message

packet switching: message broken into separately-routed packets

transmitter (typically drives clock and data)

Input ports

Switching Strategies

connects each input to any output


degree limited by area or pinout

like the telephone

like the post office

Question: what are the pros and cons of circuit switching & packet
switching?
Store & forward vs. cut-through routing

Buffering
Control logic

C ut-T h rou gh R o utin g

Store & For w ard R o uting


S ou rc e

D e st

complexity depends on routing logic and scheduling algorithm


determine output port for each incoming packet
arbitrate among inputs directed at same output

3 2 1
3 2
3

D est
3 2 1 0

3 2 1 0

3 2 1

0
1 0
2 1 0
3 2 1 0
3 2 1
3 2
3

3 2

1 0

2 1 0

3 2 1 0

1 0
2 1 0
3 2 1 0

T im e

Outline

Interconnection network issues:


Topology characteristics
Average routing distance
Diameter (maximum routing distance)
Bisection bandwidth
Switching
Packet switching vs. circuit switching
Store-&-forward vs. cut-through routing
Link, switch design
Routing

3 2 1

Routing

Interconnection network provides multiple paths between a


pair of source-dest nodes
Routing algorithm determines

which of the possible paths are used as routes


how the route is determined

Question: what desirable properties should the routing


algorithm have?

Routing Mechanism

need to select output port for each input packet

P3

in a few cycles

Simple arithmetic in regular topologies

Routing Mechanism (cont)

ex: x, y routing in a grid


Encode distance to destination in header
west (-x)
x < 0
east (+x)
x > 0
south (-y)
x = 0, y < 0
north (+y)
x = 0, y > 0
processor
x = 0, y = 0

Dimension-order routing in k-ary meshes

message header carries series of port selects


used and stripped en route
Variable sized packets: CRC? Packet Format?
CS-2, Myrinet, MIT Arctic

message header carried index for next port at next switch


o = R[i]
table also gives index for following hop
o, I = R[i ]
ATM, HPPI

Properties of Routing Algorithms

Deterministic

Adaptive

Minimal

Deadlocks

route determined by (source, dest), not intermediate state (i.e.,


traffic)

How can it arise?

route influenced by traffic along the way

only selects shortest paths

Deadlock free

no traffic pattern can lead to a situation where no packets cannot


move forward

resources are logically associated with channels


messages introduce dependences between resources as they move
forward
need to articulate the possible dependences that can arise between
channels;
show that there are no cycles in Channel Dependence Graph

find a numbering of channel resources such that every legal route follows a
monotonic sequence

=> no traffic pattern can lead to deadlock

constrain how channel resources are allocated


Question: how do we avoid deadlocks in a 2D mesh?

How do you prove that a routing algorithm is deadlock free

Proof Technique

necessary conditions:
shared resource
incrementally allocated
non-preemptible
think of a link/channel as a shared resource
that is acquired incrementally
source buffer then dest. buffer
channels along a route

How do you avoid it?

Example: 2D array

Theorem: x,y routing is deadlock free


Numbering

+x channel (i,y) (i+1,y) gets i


-x channels are numbered in the reverse direction
+y channel (x,j) (x,j+1) gets N+j
-y channels are numbered in the reverse direction

any routing sequence: x direction, turn, y direction is


increasing
1
2
3
00

network need not be acyclic, only channel dependence graph

P0

Table-driven

Reduce relative address of each dimension in order

P1

Source-based

P2

01
3

18
10
17

02
2

17

03
1

11

12

13

21

22

23

31

32

33

18
20

16
30

19

Channel Dependence Graph

Routing Deadlocks

Consider a message traveling from node 11 to node 12 and


then to node 22, and finally to node 32.
It obtains channels numbered 2 and then 18 and then 19.

00
18
10
17

01
2
17

02
1

11

1
2

1
18 17

18 17
1
2

13

20

21

22

23

31

32

33

19
30

0
18 17

1
2

1
2

0
17 18

17 18

16 19

16 19

18 17
3

17 18

17 18

16 19

16 19
3

2
1

If all turns are allowed, then channels are not obtained in


increasing order
Channel dependency graph will have a cycle:

Basic dimension order routing techniques dont work


with wrap-around edges

Idea: add channels!

Edges between 2:17, 17:1, 1:18, and 18:2

Question: what happens with a torus (or wraparound


connections)?

Deadlock free wormhole networks

18

18

16

17

03
0

12

How do we avoid deadlocks in such a situation?

Breaking deadlock with virtual channels

provide multiple virtual channels to break the dependence


cycle
Output
good for BW too! Input
Ports
Ports

Cross-Bar

Do not need to add links, or xbar, only buffer resources

Previous scheme removed edges

Packet switches
from lo to hi channel

This adds nodes to the CDG

Turn Restrictions in X,Y

Minimal turn restrictions in 2D


+y

+Y

+x

-x

-X

+X
West-first

-Y

XY routing forbids 4 of 8 turns and leaves no room


for adaptive routing
Can you allow more turns and still be deadlock free

north-last

-y

negative first

Example legal west-first routes

Adaptive Routing

Can route around failures or congestion


Can combine turn restrictions with virtual channels

R: C x N x -> C
Essential for fault tolerance
Can improve utilization of the network
Simple deterministic algorithms easily run into bad permutations

choices: fully/partially adaptive, minimal/non-minimal


can introduce complexity or anomalies
little adaptation goes a long way!

Up*-Down* routing

Given any bi-directional network


Construct a spanning tree
Number of the nodes increasing from leaves to roots

Just a topological sort of the spanning tree

Any Source -> Dest by UP*-DOWN* route

Topology Summary

up edges, single turn, down edges


Up edge: any edge going from a lower numbered node to higher
number
Down edges are the opposite
Not constrained to just using the spanning tree edges

Topology

Degree Diameter

Ave Dist

Bisection

D (D ave) @ P=1024

1D Array

N-1

N/3

huge

1D Ring

N/2

N/4

2D Mesh

2 (N1/2 - 1) 2/3 N1/2

N1/2

2D Torus

N1/2

1/2 N1/2

2N1/2

32 (16)

Butterfly

log N

log N

10 (10)

Hypercube

n =log N

n/2

N/2

10 (5)

n = 2 or n = 3

Performance?

Some numberings and routes much better than others


interacts with topology in strange ways

Short wires, easy to build; Many hops, low bisection


bandwidth

n >= 4

Harder to build, more wires, longer average length


Fewer hops, better bisection bandwidth

Butterfly Network

Low diameter:

Switches:

O(log N)
2 incoming links
2 outgoing links

Processors:

Connected to the first


and last levels

000

Routing in Butterfly Network


000

001

010

63 (21)

Routes:

001

010

011

011

100

100

101

101

110

110

111

111

Single path from a


source to a destination
Deterministic
Non-adaptive
Can run into congestion

Routing algorithm

Correct bits one at a


time
Consider: 001 111

000

000

001

001

010

010

011

011

100

100

101

101

110

110

111

111

Congestion

Easy to have two


routes share links

Congestion: worst case scenario

000

000

001

001

010

010

Bit reversal permutation:

Consider: 001 111


And 000 011

Consider just the following source-dest pairs:

How bad can it get?

Consider general
butterfly with 2r = log
N levels
Consider routing from:
Source: 000 111
Dest:
111 000
Must pass through
(after r): 000 000

011

011

100

100

101

101

110

110

111

111

Randomized Algorithm

Question:

Assume one packet from each source, assume random destinations


How many packets go through some intermediate switch at level k
in the network (on average)?
Sources that could generate a message:
Number of possible destinations: 2logN k
Expected congestion: 2k * 2logN k / 2N = 1

How do we deal with bad permutations?

2k

Turn them into two average-case behavior problems!


To route from source to dest:
Route from source to random node
Route from random node to destination
Turn initial routing problem into two average case permutations

Relationship Butterflies to Hypercubes

Why Butterfly networks?

Source: low-order r bits are zero


Of the form: b1 b2 br 0 0 0 0 0 0 br br-1 b1
All of these pass through 0 0 0 0 0 0 after r routing steps
How many such pairs exist?
Every combination of b b b
1 2
r
r
2r
Number of combinations : 2 = sqrt(2 ) = sqrt(N)

Bad permutations exist for all interconnection networks


Many networks perform well when you have locality or in the average
case

Average Case Behavior: Butterfly Networks

b1 b2 b2r-1 b2r b2r b2r-1 b2 b1

Equivalence to hypercubes and fat-trees

Fat Tree

Wiring is isomorphic
Except that Butterfly always takes log n steps

de Bruijn Network

Each node has two


outgoing links
Node x is connected to
2*x, and 2*x + 1
Example:

Node
Node
Node
Node

000
000
001
010

Summary

000

000

001

001

010

010

We covered:

Popular topologies
Routing issues
Cut-through/store-and-forward/packet-switching/circuit-switching
Deadlock-free routes:

is connected to
and Node 001
is connected to
and Node 011

011

011

How do we perform
routing on such a
network?
What is the diameter of
this network?

100

100

101

101

110

110

111

111

Limit paths
Introduce virtual channels

Link/switch design issues


Some popular routing algorithms

From software perspective:

All that matters is that the interconnection network takes a chunk of bytes
and communicates it to the target processor
Would be useful to abstract the interconnection network to some useful
performance metrics

Latency and Bandwidth

Linear Model of Communication Cost

How do you model and measure point-to-point


communication performance?

for short messages, latency dominates transfer time


for long messages, the bandwidth term dominates transfer time
What are short and long?
latency term = bandwidth term
when
latency = message_size/bandwidth
Critical message size = latency * bandwidth
Example: 50 us * 50 MB/s = 2500 bytes

mostly independent of source and destination!


linear is often a good approximation
piecewise linear is sometimes better
the latency/bandwidth model helps understand performance

A simple linear model:

messages longer than 2500 bytes are bandwidth dominated


messages shorter than 2500 bytes are latency dominated

data transfer time = latency + message size / bandwidth

latency is startup time, independent of message size


bandwidth is number of bytes per second

But linear model not enough


When can next transfer be initiated?
Can cost be overlapped?

LogGP Model

Using the Model

P ( processors )
P

o (overhead)

o
g (gap)
L (latency)
Interconnection Network

Time to send a large message:


L + o + size * G

Limited Volume
( L/ g to/from
a proc)

Latency in sending a (small) message between modules


overhead felt by the processor on sending or receiving message
gap between successive sends or receives
G: gap between successive bytes of the same message
Processors

Time to send n small messages from one processor to


another processor
L + o + (n-1)*g
processor has n*o cycles of overhead
Has (n-1)*(g-o) idle cycles that could be overlapped with other
computation

L
g

time

Some Typical LogGP values

CM5:
L
o
g
G

Slightly constrained version:

= 20.5 us
= 5.9 us
= 8.3 us
= 0.007 us (140 MB/s)

T3D:
L
o
g
G

= 16.5 us
= 6.0 us
= 6.2 us
= 0.125 us (8MB/s)

Intel Paragon:
L
o
g
G

Message Passing Programs


Separate processes, separate address spaces
Processes execute independently and concurrently
Processes transfer data cooperatively
General version: Multiple Program Multiple Data (MPMD)

MPI: most popular message passing library

= 0.85 us
= 0.40 us
= 0.40 us
= 0.007 us (140 MB/s)

Hello World (Trivial)

#include "mpi.h"
#include <stdio.h>
int main( int argc, char *argv[] )
{
int rank, size;
MPI_Init( &argc, &argv );
MPI_Comm_rank( MPI_COMM_WORLD, &rank );
MPI_Comm_size( MPI_COMM_WORLD, &size );
printf( "I am %d of %d\n", rank, size );
MPI_Finalize();
return 0;
}

#include "mpi.h"
#include <stdio.h>
int main( int argc, char *argv[] )
{
MPI_Init( &argc, &argv );
printf( "Hello, world!\n" );
MPI_Finalize();
return 0;

We need to fill in the details in


Process 0

Process 0 sends array A to process 1 which receives it as B


1:
#define TAG 123
double A[10];
MPI_Send(A, 10, MPI_DOUBLE, 1,
TAG, MPI_COMM_WORLD)
2:
#define TAG 123
double B[10];
MPI_Recv(B, 10, MPI_DOUBLE, 0,
TAG, MPI_COMM_WORLD, &status)

Process 1

Send(data)
Receive(data)

Things that need specifying:

How will processes be identified?


How will data be described?
How will the receiver recognize/screen messages?
What will it mean for these operations to complete?

Processors belong to communicators (process groups)


Default communicator is MPI_COMM_WORLD
Communicators have a size and define a rank for each
member

Point-to-Point Example

MPI Basic Send/Receive

extended message-passing model


not a language or compiler specification
not a specific implementation or product

Hello World (Independent Processes)

A simple, but not very interesting SPMD Program.


To make this legal MPI, we need to add 2 lines.

Single Program Multiple Data (SPMD)


Single code image running on different processors
Can execute independently (or asynchronously), take different branches for
instance

or
MPI_Recv(B, 10, MPI_DOUBLE, MPI_ANY_SOURCE,
MPI_ANY_TAG, MPI_COMM_WORLD, &status)
status: useful for querying the tag, source after reception

Collective Communication in MPI

MPI DataTypes

The data in a message to be sent or received is described by a


triple (address, count, datatype), where
An MPI datatype is recursively defined as:

Collective operations are called by all processes in a


communicator.
MPI_BCAST distributes data from one process to all others in
a communicator.
MPI_Bcast(start, count, datatype,
source, comm);
MPI_REDUCE combines data from all processes in
communicator and returns it to one process.
MPI_Reduce(in, out, count, datatype,
operation, dest, comm);
For example:

predefined, corresponding to a data type from the language (e.g.,


MPI_INT, MPI_DOUBLE_PRECISION)
Goal: support heterogeneous clusters
a contiguous array of MPI datatypes
a strided block of datatypes

layout in memory

an indexed array of blocks of datatypes


an arbitrary structure of datatypes

MPI_Reduce(&mysum, &sum, 1, MPI_DOUBLE, MPI_SUM, 0,


MPI_COMM_WORLD);

May improve performance:

reduces memory-to-memory copies in the implementation

allows the use of special hardware (scatter/gather) when available

Non-blocking Operations
Split communication operations into two parts.

Using Non-blocking Receive

Two advantages:
No deadlock (correctness)

First part initiates the operation. It does not block.

Second part waits for the operation to complete.


MPI_Request request;
MPI_Recv(buf, count, type, dest, tag, comm, status)
=
MPI_Irecv(buf, count, type, dest, tag, comm, &request)
+
MPI_Wait(&request, &status)

Non-Blocking Communication Gotchas

Waits for operation to complete and returns info in status


Frees request object (and sets to MPI_REQUEST_NULL)

MPI_Request_free(INOUT request)

MPI_Waitall(..., INOUT array_of_requests, ...)


MPI_Testall(..., INOUT array_of_requests, ...)
MPI_Waitany/MPI_Testany/MPI_Waitsome/MPI_Testsome

1. You may not modify the buffer between Isend() and the
corresponding Wait(). Results are undefined.
2. You may not look at or modify the buffer between Irecv() and the
corresponding Wait(). Results are undefined.
3. You may not have two pending Irecv()s for the same buffer.

Less obvious:

Frees request object but does not wait for operation to complete

Wildcards:

Obvious caveats:

MPI_Test(INOUT request, OUT flag, OUT status)


Tests to see if operation is complete and returns info in status
Frees request object if complete

Send(0)
Recv(0)

Isend(1)
compute
Wait()

Send(1)
Recv(1)

Process 0

Operations on MPI_Request
MPI_Wait(INOUT request, OUT status)

Process 1

Data may be transferred concurrently (performance)

MPI_Send(buf, count, type, dest, tag, comm)


=
MPI_Isend(buf, count, type, dest, tag, comm, &request)
+
MPI_Wait(&request, &status)

Process 0

4. You may not look at the buffer between Isend() and the
corresponding Wait().
5. You may not have two pending Isend()s for the same buffer.

Why the isend() restrictions?

Restrictions give implementations more freedom, e.g.,


Heterogeneous computer with differing byte orders
Implementation swap bytes in the original buffer

10

You might also like