Programming Techniques For Supercomputers
Programming Techniques For Supercomputers
(b)Department
Distributed-Memory Architectures
No Cache-Coherent Single Address Space
July 4, 2016
PTfS 2016
Variants:
Node:
Network:
July 4, 2016
PTfS 2016
Single CPU PC
Ethernet
Distributed-memory parallel
July 4, 2016
PTfS 2016
Networks
T = TL + N/B
TL is the latency (transfer setup time [sec]) and B is
asymptotic (N oo) network bandwidth [MBytes/sec]
July 4, 2016
PTfS 2016
PTfS 2016
Beff = 2*N/(E-S)/1.d6
Asymptotic
bandwidth
B=111 Mbytes/sec
0.888 GBit/s
July 4, 2016
Latency (N 0):
Only qualitative
agreement:
44 ms vs. 76 ms
PTfS 2016
July 4, 2016
PTfS 2016
10
PTfS 2016
11
PTfS 2016
12
Disadvantages
Shared bandwidth, not scalable
Problems with failure resiliency (one defective agent may block bus)
Fast buses for large N require large signal power
July 4, 2016
PTfS 2016
13
Non-blocking crossbar
A non-blocking
crossbar can mediate
a number of connections
between a group of
input and a group of
output elements
PTfS 2016
14
PTfS 2016
15
Oversubscribed
Spine does not support Nnodes/2
spine switch
Resource management
(job placement) is crucial
July 4, 2016
node
PTfS 2016
leaf switch
16
Static routing:
Quasi-standard in
commodity interconnects
PTfS 2016
17
Basic
building
blocks:
24-port
switches
S = 12+24
switches
LEAF switch level: 24 switches with 24*12 ports to devices
288 ports
July 4, 2016
PTfS 2016
18
InfiniBand
Dominant high-performance commodity interconnect
DDR: 20 Gbit/s per link and direction (Building blocks: 24-port switches)
QDR: 40 Gbit/s per link and direction
QDR IB is used in the RRZEs LiMa and Emmy clusters
Building blocks: 36 port switches Large 36*18=648-port switches
FDR-10 / FDR: 40/56 Gbit/s per link and direction
EDR: 100 Gbit/s per link and direction
Intel OmniPath
Up to 100 Gbit/s per link & 48-port baseline switches
Will be used in RRZE next generation cluster
PTfS 2016
19
Meshes
Fat trees can become prohibitively expensive in large systems
Compromise: Meshes
n-dimensional Hypercubes
Toruses (2D / 3D)
Many others (including hybrids)
Example: 2D
torus mesh
Toruses at very large systems: Cray XE/XK series, IBM Blue Gene
Bb ~ Nnodes(d-1)/d Bb/Nnodes0 for large Nnodes
Sounds bad, but those machines show good scaling for many codes
Well-defined and predictable bandwidth behavior!
July 4, 2016
PTfS 2016
20