Programming Techniques for Supercomputers:
Parallel Computers (2*)
Distributed-memory computers / Hybrid systems
Communication Networks
Prof. Dr. G. Wellein(a,b) , Dr. G. Hager(a) , M. Wittmann(a)
(a)HPC
Services Regionales Rechenzentrum Erlangen
fr Informatik
(b)Department
University Erlangen-Nrnberg, Sommersemester 2016
*see
lecture 7 for first part
Parallel Computers (2) - Introduction
Classification according to address space organization
Shared-Memory Architectures:
Cache-Coherent Single Address Space
Distributed-Memory Architectures
No Cache-Coherent Single Address Space
Hybrid architectures containing both concepts are state-of-the art
July 4, 2016
PTfS 2016
Distributed-memory computers & hybrid systems
Parallel distributed-memory computers: Basics
Distributed-memory parallel computer:
Each processor P is connected to exclusive
local memory (M) and a network interface (NI)
Node
A (dedicated) communication network
connects all nodes
Data exchange between nodes: Passing
messages via network (Message Passing)
Variants:
Prototype of first PC clusters:
No global (shared) address space
No Remote Memory Access (NORMA)
Node:
Network:
NON-COHENRENT shared address space
(NUMA), e.g. CRAY
PGAS languages (CoArray Fortran, UPC)
First Massively Parallel Processing
(MPP) architectures:
CRAY T3D/E, Intel Paragon
July 4, 2016
PTfS 2016
Single CPU PC
Ethernet
Parallel distributed-memory computers: Hybrid system
Standard concept of most modern large parallel computers: Hybrid/hierarchical
Compute node is a 2- or 4-socket shared memory compute nodes with a NI.
Communication network (GBit, Infiniband) connects the nodes
Price / (Peak) Performance is optimal; Network capabilities / (Peak) Perf. gets worse
Parallel Programming? Pure Message Passing is standard. Hybrid programming?
Today: GPUs / Accelerators are added to the nodes to further increase complexity
Distributed-memory parallel
July 4, 2016
PTfS 2016
Networks
What are the basic ideas and
performance characteristics of modern
networks
Networks Basic performance characteristics
Evaluate the network capabilities to transfer data
Use the same idea as for main memory access:
Total transfer time for a message of N Bytes is:
T = TL + N/B
TL is the latency (transfer setup time [sec]) and B is
asymptotic (N oo) network bandwidth [MBytes/sec]
Consider simplest case (Ping Pong)
Two processors in different nodes communicate via network
(Point-to-point)
A single message of N Bytes is sent forward and backward
Overall data transfer
is 2 N Bytes!
July 4, 2016
PTfS 2016
Networks Basic performance characteristics
Ping-Pong benchmark (schematic view)
1 myID = get_process_ID()
2 if(myID.eq.0) then
3
targetID = 1
4
S = get_walltime()
5
call Send_message(buffer,N,targetID)
6
call Receive_message(buffer,N,targetID)
7
E = get_walltime()
8
MBYTES = 2*N/(E-S)/1.d6 ! MBytes/sec rate
9
TIME = (E-S)/(2*1.d6) ! transfer time in microsecs
10 ! for single message
11 else
12
targetID = 0
13
call Receive_message(buffer,N,targetID)
14
call Send_message(buffer,N,targetID)
15 endif
July 4, 2016
PTfS 2016
Networks Basic performance characteristics
Ping-Pong benchmark for GBit-Ethernet (GigE) network
N1/2 : Message size where 50% of peak bandwidth is achieved
Beff = 2*N/(E-S)/1.d6
Asymptotic
bandwidth
B=111 Mbytes/sec
0.888 GBit/s
July 4, 2016
Latency (N 0):
Only qualitative
agreement:
44 ms vs. 76 ms
PTfS 2016
Networks Basic performance characteristics
Ping-Pong benchmark for DDR Infiniband (DDR-IB) network
Determine B and TL independently and combine them
July 4, 2016
PTfS 2016
10
Networks Basic performance characteristics
First Principles modeling of Beff(N) provides good qualitative
results but quantitative description in particular of latency
dominated region (N small) may fail because
Overhead for transmission protocols, e.g. message headers
Minimum frame size for message transmission, e.g. TCP/IP over Ethernet
does always transfer frames with N>1
Message setup/initialization involves multiple software layers and protocols;
each software layer adds to latency; hardware only latency is often small
As the message size increases the software may switch to different
protocol, e.g. from eager to rendezvous
Typical message sizes in applications are neither small nor large
N1/2 value is also important: N1/2 = B * TL
Network balance: Relate network bandwidth (B or Beff(N1/2)) to
computer power (or main memory bandwidth) of the nodes
July 4, 2016
PTfS 2016
11
Networks: Topologies & Bisection bandwidth
Network bisection bandwidth Bb is a general metric for the data
transfer capability of a system:
Minimum sum of the bandwidths of all connections cut when
splitting the system into two equal parts
More meaningful metric in terms of
system scalability:
Bisection BW per node: Bb/Nnodes
Bisection BW depends on
Bandwidth per link
Network topology
Uni- or Bi-directional bandwidth?!
July 4, 2016
PTfS 2016
12
Network topologies: Bus
Bus can be used by one
connection at a time
Bandwidth is shared among
all devices
Bisection BW is constant Bb/Nnodes ~ 1/Nnodes
Collision detection, bus arbitration protocols must be in place
Examples: PCI bus, diagnostic buses
Advantages
Low latency
Easy to implement
Disadvantages
Shared bandwidth, not scalable
Problems with failure resiliency (one defective agent may block bus)
Fast buses for large N require large signal power
July 4, 2016
PTfS 2016
13
Non-blocking crossbar
A non-blocking
crossbar can mediate
a number of connections
between a group of
input and a group of
output elements
This can be used as a
4-port non-blocking
switch (fold at the secondary diagonal)
2x2
switching
element
Switches can be cascaded to form
hierarchies (common case)
Allows scalable communication at high hardware/energy costs
Crossbars can be used as interconnects in computer systems
NEC SX9 vector system (IXS)
July 4, 2016
PTfS 2016
14
Network topologies: Switches and Fat-Trees
Standard clusters are built with switched networks
Compute nodes (devices) are split up in groups each group is
connected to single (non-blocking crossbar-)switch (leaf switches)
Leaf switches are connected with each other using an additional
switch hierarchy (spine switches) or directly (for small configs.)
Switched networks: Distance between any two devices is
heterogeneous (number of hops in switch hierarchy)
Diameter of network: The maximum number of hops required to connect two
arbitrary devices, e.g. diameter of bus=1
Perfect world: Fully non-blocking, i.e. any choice of Nnodes/2
disjoint node (device) pairs can communicate at full speed
July 4, 2016
PTfS 2016
15
Fat tree switch hierarchies
Fully non-blocking
Nnodes/2 end-to-end connections with full bandwidth
B
Bb = B * Nnodes/2
Bb/Nnodes = const. = B/2
Sounds good, but see next B
slide
Oversubscribed
Spine does not support Nnodes/2
spine switch
full BW end-to-end connections
k=3
Bb/Nnodes = const. = B/(2k),
with k oversubscription factor
Resource management
(job placement) is crucial
July 4, 2016
node
PTfS 2016
leaf switch
16
Fat trees and static routing
If all end-to-end data paths are preconfigured (static routing),
not all possible combinations of N agents will get full bandwidth
Example:
is a collision-free pattern here (15, 26,37, 48)
Change 26, 37 to 27, 36:
has collisions if no other
connections are
re-routed at the
same time
Static routing:
Quasi-standard in
commodity interconnects
However, things are
starting to improve
slowly
July 4, 2016
PTfS 2016
17
Full fat-tree: Single 288-port IB DDR-Switch
SPINE switch level: 12 switches
Basic
building
blocks:
24-port
switches
S = 12+24
switches
LEAF switch level: 24 switches with 24*12 ports to devices
288 ports
July 4, 2016
PTfS 2016
18
Fat tree networks Examples
Ethernet
1 Gbit/s &10 &100 Gbit/s variants
InfiniBand
Dominant high-performance commodity interconnect
DDR: 20 Gbit/s per link and direction (Building blocks: 24-port switches)
QDR: 40 Gbit/s per link and direction
QDR IB is used in the RRZEs LiMa and Emmy clusters
Building blocks: 36 port switches Large 36*18=648-port switches
FDR-10 / FDR: 40/56 Gbit/s per link and direction
EDR: 100 Gbit/s per link and direction
Intel OmniPath
Up to 100 Gbit/s per link & 48-port baseline switches
Will be used in RRZE next generation cluster
Expensive & complex to scale to very high node counts
July 4, 2016
PTfS 2016
19
Meshes
Fat trees can become prohibitively expensive in large systems
Compromise: Meshes
n-dimensional Hypercubes
Toruses (2D / 3D)
Many others (including hybrids)
Each node is a router
Example: 2D
torus mesh
Direct connections only between
direct neighbors
This is not a non-blocking corossbar!
Intelligent resource management and
routing algorithms are essential
Toruses at very large systems: Cray XE/XK series, IBM Blue Gene
Bb ~ Nnodes(d-1)/d Bb/Nnodes0 for large Nnodes
Sounds bad, but those machines show good scaling for many codes
Well-defined and predictable bandwidth behavior!
July 4, 2016
PTfS 2016
20