Distributed Memory Machines
Distributed Memory Machines
Arvind Krishnamurthy
Fall 2004
P1
NI
P2
memory
memory
NI
Pn
...
NI
memory
interconnect
Historical Perspective
Network Analogy
link = street
switch = intersection
distances (hops) = number of blocks traveled
routing algorithm = travel plans
Important Properties:
Network Characteristics
Topology Properties
Bisection bandwidth
Routing
and control
header
Error code
Switching strategy
Trailer
Linear array
2D Mesh:
Torus or Ring
Diameter: 2 n
Bisection bandwidth: n
Hypercubes
Trees
Diameter: d
Bisection bandwidth is n/2
Diameter: log n
Bisection bandwidth: 1
Easy layout as planar graph
Many tree algorithms (summation)
Fat trees avoid bisection bandwidth problem
Greycode addressing
111
011
100
000
101
001
Butterflies
Outline
O
Switches
Short:
- single logical
value at a time
Input
Buffer
Receiver
Input
Ports
Synchronous:
- source & dest on same
clock
Output
Buffer Transmiter
Output
Ports
Cross-bar
Long:
- stream of logical
values at a time
Control
Routing, Scheduling
Asynchronous:
- source encodes clock in
signal
Wide:
- control, data and timing
on separate wires
Switch Components
Output ports
Crossbar
Input ports
Switching Strategies
Question: what are the pros and cons of circuit switching & packet
switching?
Store & forward vs. cut-through routing
Buffering
Control logic
D e st
3 2 1
3 2
3
D est
3 2 1 0
3 2 1 0
3 2 1
0
1 0
2 1 0
3 2 1 0
3 2 1
3 2
3
3 2
1 0
2 1 0
3 2 1 0
1 0
2 1 0
3 2 1 0
T im e
Outline
3 2 1
Routing
Routing Mechanism
P3
in a few cycles
Deterministic
Adaptive
Minimal
Deadlocks
Deadlock free
find a numbering of channel resources such that every legal route follows a
monotonic sequence
Proof Technique
necessary conditions:
shared resource
incrementally allocated
non-preemptible
think of a link/channel as a shared resource
that is acquired incrementally
source buffer then dest. buffer
channels along a route
Example: 2D array
P0
Table-driven
P1
Source-based
P2
01
3
18
10
17
02
2
17
03
1
11
12
13
21
22
23
31
32
33
18
20
16
30
19
Routing Deadlocks
00
18
10
17
01
2
17
02
1
11
1
2
1
18 17
18 17
1
2
13
20
21
22
23
31
32
33
19
30
0
18 17
1
2
1
2
0
17 18
17 18
16 19
16 19
18 17
3
17 18
17 18
16 19
16 19
3
2
1
18
18
16
17
03
0
12
Cross-Bar
Packet switches
from lo to hi channel
+Y
+x
-x
-X
+X
West-first
-Y
north-last
-y
negative first
Adaptive Routing
R: C x N x -> C
Essential for fault tolerance
Can improve utilization of the network
Simple deterministic algorithms easily run into bad permutations
Up*-Down* routing
Topology Summary
Topology
Degree Diameter
Ave Dist
Bisection
D (D ave) @ P=1024
1D Array
N-1
N/3
huge
1D Ring
N/2
N/4
2D Mesh
N1/2
2D Torus
N1/2
1/2 N1/2
2N1/2
32 (16)
Butterfly
log N
log N
10 (10)
Hypercube
n =log N
n/2
N/2
10 (5)
n = 2 or n = 3
Performance?
n >= 4
Butterfly Network
Low diameter:
Switches:
O(log N)
2 incoming links
2 outgoing links
Processors:
000
001
010
63 (21)
Routes:
001
010
011
011
100
100
101
101
110
110
111
111
Routing algorithm
000
000
001
001
010
010
011
011
100
100
101
101
110
110
111
111
Congestion
000
000
001
001
010
010
Consider general
butterfly with 2r = log
N levels
Consider routing from:
Source: 000 111
Dest:
111 000
Must pass through
(after r): 000 000
011
011
100
100
101
101
110
110
111
111
Randomized Algorithm
Question:
2k
Fat Tree
Wiring is isomorphic
Except that Butterfly always takes log n steps
de Bruijn Network
Node
Node
Node
Node
000
000
001
010
Summary
000
000
001
001
010
010
We covered:
Popular topologies
Routing issues
Cut-through/store-and-forward/packet-switching/circuit-switching
Deadlock-free routes:
is connected to
and Node 001
is connected to
and Node 011
011
011
How do we perform
routing on such a
network?
What is the diameter of
this network?
100
100
101
101
110
110
111
111
Limit paths
Introduce virtual channels
All that matters is that the interconnection network takes a chunk of bytes
and communicates it to the target processor
Would be useful to abstract the interconnection network to some useful
performance metrics
LogGP Model
P ( processors )
P
o (overhead)
o
g (gap)
L (latency)
Interconnection Network
Limited Volume
( L/ g to/from
a proc)
L
g
time
CM5:
L
o
g
G
= 20.5 us
= 5.9 us
= 8.3 us
= 0.007 us (140 MB/s)
T3D:
L
o
g
G
= 16.5 us
= 6.0 us
= 6.2 us
= 0.125 us (8MB/s)
Intel Paragon:
L
o
g
G
= 0.85 us
= 0.40 us
= 0.40 us
= 0.007 us (140 MB/s)
#include "mpi.h"
#include <stdio.h>
int main( int argc, char *argv[] )
{
int rank, size;
MPI_Init( &argc, &argv );
MPI_Comm_rank( MPI_COMM_WORLD, &rank );
MPI_Comm_size( MPI_COMM_WORLD, &size );
printf( "I am %d of %d\n", rank, size );
MPI_Finalize();
return 0;
}
#include "mpi.h"
#include <stdio.h>
int main( int argc, char *argv[] )
{
MPI_Init( &argc, &argv );
printf( "Hello, world!\n" );
MPI_Finalize();
return 0;
Process 1
Send(data)
Receive(data)
Point-to-Point Example
or
MPI_Recv(B, 10, MPI_DOUBLE, MPI_ANY_SOURCE,
MPI_ANY_TAG, MPI_COMM_WORLD, &status)
status: useful for querying the tag, source after reception
MPI DataTypes
layout in memory
Non-blocking Operations
Split communication operations into two parts.
Two advantages:
No deadlock (correctness)
MPI_Request_free(INOUT request)
1. You may not modify the buffer between Isend() and the
corresponding Wait(). Results are undefined.
2. You may not look at or modify the buffer between Irecv() and the
corresponding Wait(). Results are undefined.
3. You may not have two pending Irecv()s for the same buffer.
Less obvious:
Frees request object but does not wait for operation to complete
Wildcards:
Obvious caveats:
Send(0)
Recv(0)
Isend(1)
compute
Wait()
Send(1)
Recv(1)
Process 0
Operations on MPI_Request
MPI_Wait(INOUT request, OUT status)
Process 1
Process 0
4. You may not look at the buffer between Isend() and the
corresponding Wait().
5. You may not have two pending Isend()s for the same buffer.
10