0% found this document useful (0 votes)

148 views20 pages

Programming Techniques For Supercomputers

This document discusses parallel distributed-memory computers and hybrid systems. It begins by introducing the basic concepts of distributed-memory parallel computers, including that each processor has exclusive local memory and communicates via message passing over a communication network. Popular network topologies like fat trees and meshes are then discussed. Finally, the document analyzes key network performance characteristics like bandwidth, latency, and bisection bandwidth, using examples from real networks like Gigabit Ethernet and InfiniBand.

Uploaded by

Varun Nagpal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

148 views20 pages

Programming Techniques For Supercomputers

Uploaded by

Varun Nagpal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Programming Techniques for Supercomputers:

Parallel Computers (2*)

Distributed-memory computers / Hybrid systems
Communication Networks

Prof. Dr. G. Wellein(a,b) , Dr. G. Hager(a) , M. Wittmann(a)

(a)HPC

Services Regionales Rechenzentrum Erlangen

fr Informatik

(b)Department

University Erlangen-Nrnberg, Sommersemester 2016

*see

lecture 7 for first part

Parallel Computers (2) - Introduction

Classification according to address space organization
Shared-Memory Architectures:
Cache-Coherent Single Address Space

Distributed-Memory Architectures
No Cache-Coherent Single Address Space

Hybrid architectures containing both concepts are state-of-the art

July 4, 2016

PTfS 2016

Distributed-memory computers & hybrid systems

Parallel distributed-memory computers: Basics

Distributed-memory parallel computer:

Each processor P is connected to exclusive

local memory (M) and a network interface (NI)
Node
A (dedicated) communication network
connects all nodes
Data exchange between nodes: Passing
messages via network (Message Passing)

Variants:

Prototype of first PC clusters:

No global (shared) address space

No Remote Memory Access (NORMA)

Node:
Network:

NON-COHENRENT shared address space

(NUMA), e.g. CRAY
PGAS languages (CoArray Fortran, UPC)

First Massively Parallel Processing

(MPP) architectures:
CRAY T3D/E, Intel Paragon

July 4, 2016

PTfS 2016

Single CPU PC
Ethernet

Parallel distributed-memory computers: Hybrid system

Standard concept of most modern large parallel computers: Hybrid/hierarchical
Compute node is a 2- or 4-socket shared memory compute nodes with a NI.
Communication network (GBit, Infiniband) connects the nodes
Price / (Peak) Performance is optimal; Network capabilities / (Peak) Perf. gets worse
Parallel Programming? Pure Message Passing is standard. Hybrid programming?
Today: GPUs / Accelerators are added to the nodes to further increase complexity

Distributed-memory parallel
July 4, 2016

PTfS 2016

Networks

What are the basic ideas and

performance characteristics of modern
networks

Networks Basic performance characteristics

Evaluate the network capabilities to transfer data
Use the same idea as for main memory access:
Total transfer time for a message of N Bytes is:

T = TL + N/B
TL is the latency (transfer setup time [sec]) and B is
asymptotic (N oo) network bandwidth [MBytes/sec]

Consider simplest case (Ping Pong)

Two processors in different nodes communicate via network
(Point-to-point)
A single message of N Bytes is sent forward and backward
Overall data transfer
is 2 N Bytes!

July 4, 2016

PTfS 2016

Networks Basic performance characteristics

Ping-Pong benchmark (schematic view)
1 myID = get_process_ID()
2 if(myID.eq.0) then
3
targetID = 1
4
S = get_walltime()
5
call Send_message(buffer,N,targetID)
6
call Receive_message(buffer,N,targetID)
7
E = get_walltime()
8
MBYTES = 2*N/(E-S)/1.d6 ! MBytes/sec rate
9
TIME = (E-S)/(2*1.d6) ! transfer time in microsecs
10 ! for single message
11 else
12
targetID = 0
13
call Receive_message(buffer,N,targetID)
14
call Send_message(buffer,N,targetID)
15 endif
July 4, 2016

PTfS 2016

Networks Basic performance characteristics

Ping-Pong benchmark for GBit-Ethernet (GigE) network
N1/2 : Message size where 50% of peak bandwidth is achieved

Beff = 2*N/(E-S)/1.d6

Asymptotic
bandwidth
B=111 Mbytes/sec
0.888 GBit/s

July 4, 2016

Latency (N 0):
Only qualitative
agreement:
44 ms vs. 76 ms

PTfS 2016

Networks Basic performance characteristics

Ping-Pong benchmark for DDR Infiniband (DDR-IB) network
Determine B and TL independently and combine them

July 4, 2016

PTfS 2016

Networks Basic performance characteristics

First Principles modeling of Beff(N) provides good qualitative
results but quantitative description in particular of latency
dominated region (N small) may fail because
Overhead for transmission protocols, e.g. message headers
Minimum frame size for message transmission, e.g. TCP/IP over Ethernet
does always transfer frames with N>1
Message setup/initialization involves multiple software layers and protocols;
each software layer adds to latency; hardware only latency is often small

As the message size increases the software may switch to different

protocol, e.g. from eager to rendezvous

Typical message sizes in applications are neither small nor large

N1/2 value is also important: N1/2 = B * TL
Network balance: Relate network bandwidth (B or Beff(N1/2)) to
computer power (or main memory bandwidth) of the nodes
July 4, 2016

PTfS 2016

Networks: Topologies & Bisection bandwidth

Network bisection bandwidth Bb is a general metric for the data
transfer capability of a system:
Minimum sum of the bandwidths of all connections cut when
splitting the system into two equal parts
More meaningful metric in terms of
system scalability:
Bisection BW per node: Bb/Nnodes
Bisection BW depends on
Bandwidth per link
Network topology

Uni- or Bi-directional bandwidth?!

July 4, 2016

PTfS 2016

Disadvantages
Shared bandwidth, not scalable
Problems with failure resiliency (one defective agent may block bus)
Fast buses for large N require large signal power
July 4, 2016

PTfS 2016

Non-blocking crossbar
A non-blocking
crossbar can mediate
a number of connections
between a group of
input and a group of
output elements

This can be used as a

4-port non-blocking
switch (fold at the secondary diagonal)
2x2
switching
element

Switches can be cascaded to form

hierarchies (common case)
Allows scalable communication at high hardware/energy costs
Crossbars can be used as interconnects in computer systems
NEC SX9 vector system (IXS)
July 4, 2016

PTfS 2016

Network topologies: Switches and Fat-Trees

Standard clusters are built with switched networks
Compute nodes (devices) are split up in groups each group is
connected to single (non-blocking crossbar-)switch (leaf switches)

Leaf switches are connected with each other using an additional

switch hierarchy (spine switches) or directly (for small configs.)
Switched networks: Distance between any two devices is
heterogeneous (number of hops in switch hierarchy)
Diameter of network: The maximum number of hops required to connect two
arbitrary devices, e.g. diameter of bus=1

Perfect world: Fully non-blocking, i.e. any choice of Nnodes/2

disjoint node (device) pairs can communicate at full speed
July 4, 2016

PTfS 2016

Fat tree switch hierarchies

Fully non-blocking
Nnodes/2 end-to-end connections with full bandwidth
B
Bb = B * Nnodes/2
Bb/Nnodes = const. = B/2
Sounds good, but see next B
slide

Oversubscribed
Spine does not support Nnodes/2

spine switch

full BW end-to-end connections

k=3

Bb/Nnodes = const. = B/(2k),

with k oversubscription factor

Resource management
(job placement) is crucial
July 4, 2016

node
PTfS 2016

leaf switch
16

Fat trees and static routing

If all end-to-end data paths are preconfigured (static routing),
not all possible combinations of N agents will get full bandwidth
Example:
is a collision-free pattern here (15, 26,37, 48)
Change 26, 37 to 27, 36:
has collisions if no other
connections are
re-routed at the
same time

Static routing:
Quasi-standard in
commodity interconnects

However, things are

starting to improve
slowly
July 4, 2016

PTfS 2016

Full fat-tree: Single 288-port IB DDR-Switch

SPINE switch level: 12 switches

Basic
building
blocks:
24-port
switches
S = 12+24
switches
LEAF switch level: 24 switches with 24*12 ports to devices
288 ports
July 4, 2016

PTfS 2016

Fat tree networks Examples

Ethernet
1 Gbit/s &10 &100 Gbit/s variants

InfiniBand
Dominant high-performance commodity interconnect
DDR: 20 Gbit/s per link and direction (Building blocks: 24-port switches)
QDR: 40 Gbit/s per link and direction
QDR IB is used in the RRZEs LiMa and Emmy clusters
Building blocks: 36 port switches Large 36*18=648-port switches
FDR-10 / FDR: 40/56 Gbit/s per link and direction
EDR: 100 Gbit/s per link and direction

Intel OmniPath
Up to 100 Gbit/s per link & 48-port baseline switches
Will be used in RRZE next generation cluster

Expensive & complex to scale to very high node counts

July 4, 2016

PTfS 2016

Meshes
Fat trees can become prohibitively expensive in large systems
Compromise: Meshes
n-dimensional Hypercubes
Toruses (2D / 3D)
Many others (including hybrids)

Each node is a router

Example: 2D
torus mesh

Direct connections only between

direct neighbors
This is not a non-blocking corossbar!
Intelligent resource management and
routing algorithms are essential

Toruses at very large systems: Cray XE/XK series, IBM Blue Gene
Bb ~ Nnodes(d-1)/d Bb/Nnodes0 for large Nnodes
Sounds bad, but those machines show good scaling for many codes
Well-defined and predictable bandwidth behavior!
July 4, 2016

PTfS 2016

TP 6009
No ratings yet
TP 6009
76 pages
PA-34 Question Bank-1
No ratings yet
PA-34 Question Bank-1
5 pages
1z0 511
No ratings yet
1z0 511
54 pages
Presentation-POWERGRID Disaster Management Plan
No ratings yet
Presentation-POWERGRID Disaster Management Plan
42 pages
Eap - Report
No ratings yet
Eap - Report
34 pages
Lec3 InnerconnectionNetworks
No ratings yet
Lec3 InnerconnectionNetworks
28 pages
Network 34
No ratings yet
Network 34
76 pages
CS621 Final Term
No ratings yet
CS621 Final Term
111 pages
CN Module 1
No ratings yet
CN Module 1
74 pages
Network On Chip
100% (1)
Network On Chip
62 pages
Distributed Memory Machines
No ratings yet
Distributed Memory Machines
10 pages
CS621 FT Highlighted by Vaniza
No ratings yet
CS621 FT Highlighted by Vaniza
111 pages
Inter Connection Networks and Cluster
No ratings yet
Inter Connection Networks and Cluster
49 pages
Network 2: Protocols, Routing, Wireless: Prof - Lawrence Rauchwerger
No ratings yet
Network 2: Protocols, Routing, Wireless: Prof - Lawrence Rauchwerger
36 pages
Lecture 5 Network Topologies For Parallel Architectures - Updated
No ratings yet
Lecture 5 Network Topologies For Parallel Architectures - Updated
46 pages
Umacaedmy Sanchit Hain Sir Lector Slides CN
No ratings yet
Umacaedmy Sanchit Hain Sir Lector Slides CN
101 pages
Lecture 4 Network Topologies For Parallel Architecture
No ratings yet
Lecture 4 Network Topologies For Parallel Architecture
34 pages
TCS Full
No ratings yet
TCS Full
179 pages
Computer - Networks - CS301 Notes
No ratings yet
Computer - Networks - CS301 Notes
14 pages
Lecture 4
No ratings yet
Lecture 4
33 pages
Computer Networks vs. Distributed Systems
No ratings yet
Computer Networks vs. Distributed Systems
68 pages
Parallel Architecture
No ratings yet
Parallel Architecture
33 pages
lect3-parallel system
No ratings yet
lect3-parallel system
31 pages
Static and Dynamic
No ratings yet
Static and Dynamic
43 pages
Lecture 1A
No ratings yet
Lecture 1A
52 pages
Data Communication and Networking Itpc 17
No ratings yet
Data Communication and Networking Itpc 17
108 pages
Networks
No ratings yet
Networks
31 pages
Math - Lesson No - 10
No ratings yet
Math - Lesson No - 10
6 pages
Advance Computer Architecture: Unit:Ii System Interconnect Architectures
No ratings yet
Advance Computer Architecture: Unit:Ii System Interconnect Architectures
53 pages
CNC Module 1
No ratings yet
CNC Module 1
88 pages
Data Communication and Networking Notes 3 - TutorialsDuniya
No ratings yet
Data Communication and Networking Notes 3 - TutorialsDuniya
92 pages
Module-1 (1 and 2 Chapter)
No ratings yet
Module-1 (1 and 2 Chapter)
95 pages
1 1st Module
No ratings yet
1 1st Module
66 pages
Lecture - Telecommunication Network Model and Types
No ratings yet
Lecture - Telecommunication Network Model and Types
19 pages
Lecture 1 Intro-Topologies
No ratings yet
Lecture 1 Intro-Topologies
43 pages
Lecture - 1 Introduction
No ratings yet
Lecture - 1 Introduction
46 pages
IEEE Standards
No ratings yet
IEEE Standards
3 pages
Unacademy Computer Networks Sanchit Jain
No ratings yet
Unacademy Computer Networks Sanchit Jain
98 pages
Chapter No-1: Introduction To Networking
No ratings yet
Chapter No-1: Introduction To Networking
34 pages
Internetworking Lecture PartA
No ratings yet
Internetworking Lecture PartA
44 pages
Chapter 1
No ratings yet
Chapter 1
55 pages
CPE 422 Computer Networks DR Ali H. El-Mousa Elmousa@ju - Edu.jo
No ratings yet
CPE 422 Computer Networks DR Ali H. El-Mousa Elmousa@ju - Edu.jo
66 pages
Module 1-2 - Network Criteria, Network Topology, Switching Techniques, NW Types
No ratings yet
Module 1-2 - Network Criteria, Network Topology, Switching Techniques, NW Types
41 pages
(English) Free CCNA - Spanning Tree Protocol (Part 1) - Day 20 - CCNA 200-301 Complete Course (DownSub - Com)
No ratings yet
(English) Free CCNA - Spanning Tree Protocol (Part 1) - Day 20 - CCNA 200-301 Complete Course (DownSub - Com)
24 pages
Unit 4.1 High-Speed Networks
No ratings yet
Unit 4.1 High-Speed Networks
26 pages
Multiprocessor Interconnection Networks Networks: CS 740 November 19, 2003
No ratings yet
Multiprocessor Interconnection Networks Networks: CS 740 November 19, 2003
8 pages
03 Introduction
No ratings yet
03 Introduction
28 pages
Appendix F: Authors: John Hennessy & David Patterson
No ratings yet
Appendix F: Authors: John Hennessy & David Patterson
33 pages
Lecture Note On Switch Architectures
No ratings yet
Lecture Note On Switch Architectures
63 pages
1st Talk
No ratings yet
1st Talk
75 pages
UNIT - 1 NW Fundamentals
100% (1)
UNIT - 1 NW Fundamentals
72 pages
Reference:: RK Topologies - HTTP://WWW - Cs.umd - Edu/class/fall2001/cmsc411/proj01/pub/five - HTML
No ratings yet
Reference:: RK Topologies - HTTP://WWW - Cs.umd - Edu/class/fall2001/cmsc411/proj01/pub/five - HTML
16 pages
Module 1-Network Criteria, Network Topology
No ratings yet
Module 1-Network Criteria, Network Topology
33 pages
CS601 Updated Handout Lesson 1-70 by Farhan Ahmadani
No ratings yet
CS601 Updated Handout Lesson 1-70 by Farhan Ahmadani
92 pages
Chapter2 Part2NetworkTopologies
No ratings yet
Chapter2 Part2NetworkTopologies
15 pages
19CS303T-Computer Networks
No ratings yet
19CS303T-Computer Networks
68 pages
CN Unit 1
No ratings yet
CN Unit 1
81 pages
Lecture W1 CN Introduction
No ratings yet
Lecture W1 CN Introduction
49 pages
CHP 1 Intro and Basic Concepts
No ratings yet
CHP 1 Intro and Basic Concepts
37 pages
Computer Networks Basics
No ratings yet
Computer Networks Basics
21 pages
1 - Introduction To Computer Networks With Key
No ratings yet
1 - Introduction To Computer Networks With Key
17 pages
IT 210 Week01 Forouzan Chapter 01
No ratings yet
IT 210 Week01 Forouzan Chapter 01
46 pages
Installation Licensing FAQ
No ratings yet
Installation Licensing FAQ
15 pages
Libero 11 7 Release Notes
No ratings yet
Libero 11 7 Release Notes
14 pages
Probability - Notes2
No ratings yet
Probability - Notes2
5 pages
Cheat Sheet
No ratings yet
Cheat Sheet
2 pages
Primary Account Holder Joint Account Holder 1 Joint Account Holder 2
No ratings yet
Primary Account Holder Joint Account Holder 1 Joint Account Holder 2
1 page
SystemC TLM
No ratings yet
SystemC TLM
33 pages
Introduction To High Performance Scientific Computing
No ratings yet
Introduction To High Performance Scientific Computing
510 pages
Auditing A Practical Manual For Auditors - Lawrence R Dicksee 1905
100% (1)
Auditing A Practical Manual For Auditors - Lawrence R Dicksee 1905
396 pages
Data Mining Solution
No ratings yet
Data Mining Solution
7 pages
Unit One Process Costing
No ratings yet
Unit One Process Costing
9 pages
Bsce 4a Cecom1 Ass
No ratings yet
Bsce 4a Cecom1 Ass
9 pages
10th Maths Ch-3 Pair of LE2V
No ratings yet
10th Maths Ch-3 Pair of LE2V
10 pages
Return To Play in Elite Sport: A Shared Decision-Making Process
No ratings yet
Return To Play in Elite Sport: A Shared Decision-Making Process
4 pages
CSS Ncii G 11 Reviewer
No ratings yet
CSS Ncii G 11 Reviewer
7 pages
Replacement of River Sand With Proportions of Sea Sand
No ratings yet
Replacement of River Sand With Proportions of Sea Sand
8 pages
9 Sem Landscape 2 - 2014
No ratings yet
9 Sem Landscape 2 - 2014
4 pages
Netflix: Made By:-Paras Yadav 19MBA013
No ratings yet
Netflix: Made By:-Paras Yadav 19MBA013
8 pages
Application Proforma of IYBF Program 2023-24
No ratings yet
Application Proforma of IYBF Program 2023-24
6 pages
Social Media Impact Presentation 1
No ratings yet
Social Media Impact Presentation 1
12 pages
3CL-70174 - Design Calculation BSDG Building-16.01.15
No ratings yet
3CL-70174 - Design Calculation BSDG Building-16.01.15
310 pages
Reinforced Concrete Two Way Slab
No ratings yet
Reinforced Concrete Two Way Slab
3 pages
SPRING by Guiseppe Arcimboldo A Mannerism Artist
No ratings yet
SPRING by Guiseppe Arcimboldo A Mannerism Artist
4 pages
Whirlpool Whes40 Whes44 Whescs
No ratings yet
Whirlpool Whes40 Whes44 Whescs
18 pages
Dissociative Disorders
No ratings yet
Dissociative Disorders
3 pages
Star - Observation Approach-School Head
No ratings yet
Star - Observation Approach-School Head
2 pages
Group 4 LP 2
No ratings yet
Group 4 LP 2
31 pages
Azure Lab 4 Manage Networks in The Cloud
No ratings yet
Azure Lab 4 Manage Networks in The Cloud
2 pages
Ford (Australia) Cortina MK I
No ratings yet
Ford (Australia) Cortina MK I
5 pages
Unit IV Chapter 1
No ratings yet
Unit IV Chapter 1
13 pages
Fulcrum-24 Programme Schedule
No ratings yet
Fulcrum-24 Programme Schedule
3 pages
Resume
No ratings yet
Resume
2 pages
English: Quarter 4 - Module 1: The Features of Academic Writing
100% (2)
English: Quarter 4 - Module 1: The Features of Academic Writing
23 pages

Programming Techniques For Supercomputers

Uploaded by

Programming Techniques For Supercomputers

Uploaded by

Programming Techniques for Supercomputers:

Parallel Computers (2*)

Prof. Dr. G. Wellein(a,b) , Dr. G. Hager(a) , M. Wittmann(a)

Services Regionales Rechenzentrum Erlangen

University Erlangen-Nrnberg, Sommersemester 2016

lecture 7 for first part

Parallel Computers (2) - Introduction

Hybrid architectures containing both concepts are state-of-the art

Distributed-memory computers & hybrid systems

Parallel distributed-memory computers: Basics

Each processor P is connected to exclusive

Prototype of first PC clusters:

No global (shared) address space

NON-COHENRENT shared address space

First Massively Parallel Processing

Parallel distributed-memory computers: Hybrid system

What are the basic ideas and

Networks Basic performance characteristics

Consider simplest case (Ping Pong)

Networks Basic performance characteristics

Networks Basic performance characteristics

Networks Basic performance characteristics

Networks Basic performance characteristics

As the message size increases the software may switch to different

Typical message sizes in applications are neither small nor large

Networks: Topologies & Bisection bandwidth

Uni- or Bi-directional bandwidth?!

Network topologies: Bus

This can be used as a

Switches can be cascaded to form

Network topologies: Switches and Fat-Trees

Leaf switches are connected with each other using an additional

Perfect world: Fully non-blocking, i.e. any choice of Nnodes/2

Fat tree switch hierarchies

full BW end-to-end connections

Bb/Nnodes = const. = B/(2k),

Fat trees and static routing

However, things are

Full fat-tree: Single 288-port IB DDR-Switch

Fat tree networks Examples

Expensive & complex to scale to very high node counts

Each node is a router

Direct connections only between

You might also like