W1 Hardwareoverview.4u
W1 Hardwareoverview.4u
● so we talk the same language ● floating point operations (flops) - add, mult, division (sqrt maybe!)
● integer and logical operations (and, or, etc.)
● many things happen in parallel even on a single processor
● instruction processing (fetch, decoding, etc.)
● identify potential issues that (explicitly) parallel hardware can overcome
● our primary focus will be in flops (as per required by most scientific applications)
● why should we use 2 CPUs instead of doubling the speed on one! ● main performance metric: flops/sec or just FLOPS
Multiple Processor Design The processor clock orchestrates its operation:
● Flynn’s taxonomy of parallel computers (SIMD vs MIMD) ● all ops take a fixed number of clock ticks to complete (latency)
● message-passing versus shared-address space programming ● clock speed is measured in GHz (109 cycles/second) or nsec (10−9 seconds)
● UMA versus NUMA shared-memory computers ■ Apple iPhone 6 ARM A8 1.4GHz (0.71ns), NCI Gadi Intel Xeon Cascade Lake
3.2GHz (0.31ns), IBM zEC12 processor 5.5Ghz (0.18ns)
● dynamic/static connectivity
● clock speed limited by: transistor speed, speed of light, energy consumption, etc.
● evaluating static networks
■ (to our knowledge) IBM zEC12 is fastest commodity processor at 5.5GHz
● case study: the NCI Gadi supercomputer ■ light travels about 1cm in 3.2ns, a chip is a few cm!
COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 1 COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 2
COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 3 COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 4
Illustrating pipelining with an example: Adding Numbers Instruction Pipelining (Single Instruction Issue)
● break instructions into k stages each that are overlapped in time
● eg. (k = 5): stages FI = Fetch Instrn., DI = Decode Instrn., FO = Fetch Operand, EX
Step in Pipeline
= Execute Instrn., WB = Write Back
Waiting 1 2 3 4 Done
(branch): FI DI FO EX WB
X(6) (guess) FI DI FO EX WB
X(5)→ (guess) FI DI FO EX WB
X(4)→ (guess) FI DI FO EX WB
X(3)→ (sure) FI DI FO EX WB
X(2)→
X(1) ● Ideally, one gets k-way asymptotic parallelism (speedup)
● X(1) takes 4 clock ticks to appear (startup latency); X(2) appears 1 tick after X(1) ● However, hard to maximize utilization in practice:
● asymptotically achieves 1 result per tick ■ Constrained by dependencies among instructions; CPU must ensure result is
the same as if no pipelining!
● the operation (X) is said to be pipelined: steps in the pipeline are running in parallel ■ FO & WB stages may involve memory accesses (and may possibly stall the pipeline)
● requires same op consecutively on different (independent) data items ■ conditional branch instructions are problematic: the wrong guess may require
flushing succeeding instructions from the pipeline and rolling back
■ good for “vector operations” (note limitations on chaining output data to input)
● tendency to increase # of stages (specially acute during 90s-20s)
examples of #stages: UltraSPARC II (9) and III (14), Intel Prescott (31)
COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 5 COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 6
● Simple idea: Increase execution rate by using w ≥ 2 (i.e., multiple) pipelines Consider the DAXPY computation:
● w (mutually independent) instructions are (tried to be) piped in parallel at each cycle y(i) = y(i) + 1.234 ∗ x(i)
● Ideally it offers kw-way parallelism (recall k is the number of pipeline stages) If at its peak the CPU can perform 8 flops/cycle (4 fused mult-add)
● However, a number of extra challenges arise:
● the memory system must load 8 doubles (x(i) and y(i) – 64 bytes) and store 4
■ Increased complexity: HW has to be able to resolve dependencies at runtime (y(i) – 32 bytes) each clock cycle
before issuing simultaneously several instructions
■ on a 2 GHz system this implies a memory system able to sustain 128 GB/s load
■ Some of the functional units might be shared by the pipelines (aka resource
traffic and 64 GB/s store traffic
dependencies)
■ As a result, instructions to be issued together must have an appropriate ● despite advances in memory technology (e.g., DDR5 SDRAM), memory is not able
‘instruction mix’ to pump data at such high rates
≤ 2 different floating point
e.g. UltraSPARC (w = 4): ≤ 1 load / store ; ≤ 1 branch Memory latency and bandwidth are critical performance issues
≤ 2 integer / logical
● Some remedies: pipeline feedback, branch prediction + speculative execution, ● caches: reduce latency and provide improved cache to CPU bandwidth
out-of-order execution, compilers (e.g., VLIW processors) ● multiple memory banks: improve bandwidth (by parallel access)
COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 7 COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 8
Memory Hierarchy Going (Explicitly) Parallel
Main Memory −→ large, cheap memory; large latency/small bandwidth ● performance of a single processor is irremediably limited by clock rate
↓ ● clock rate in turn limited by power consumption, transistor switching time, etc.
Cache −→ small, expensive memory; lower latency/higher bandwidth
↓↓↓↓↓ ● ILP allows multiple instructions at once, but it is limited by dependencies
CPU Registers ● many problems are inherently distributed/exhibit potential parallelism
● memory is partitioned into blocks (cache lines) and mapped to cache lines using a It’s time to go (explicitly) parallel
mapping algorithm (e.g., completely associative, direct, n-way associative)
Parallel Hardware Overview
● cache lines are typically 16-128 bytes wide; entire cache lines fetched from
memory, not just one element (why?) ● Flynn’s Taxonomy of parallel processors (1966,1972)
● cache hit (few cycles)/cache miss (large number of cycles) ■ (SISD/SIMD/)SIMD/MIMD
● try to structure code to use an entire cache line of data before replacement (e.g., ● message-passing versus shared-address space programming
blocking strategies in dense matrix-matrix multiplication)
● UMA versus NUMA shared-memory computers
Cache memory is effective because algorithms often use data that: ● dynamic/static networks
● was recently accessed from memory (temporal locality) ● evaluating cost and performance of static networks
● was close to other recently accessed data (spatial locality) ● case study: NCI’s Gadi (2020–)
COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 9 COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 10
MIMD: Multiple Instruction Multiple Data ● scheduling: efficient allocation of processors to tasks in a dynamic fashion
● synchronization: prevent processors accessing the same data simultaneously
● examples include: (1) quad-core PC; (2) 2x24-core Xeon CPUs on each Gadi node
● interconnect design: processor to memory and processor to processor
interconnects. Also I/O network - often processors dedicated to I/O devices
Global
Control Unit
● overhead: inevitably there is some overhead associated with coordinating activities
between processors, e.g. resolve contention for resources
CPU and CPU and CPU and CPU and
CPU CPU CPU CPU Control Control Control Control
INTERCONNECT INTERCONNECT
● partitioning: partitioning a computation/algorithm into concurrent tasks might not be
trivial and require algorithm redesign and/or significant programming efforts
SIMD MIMD
COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 11 COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 12
Logical classification of parallel computers Address Space Organization: Message Passing
COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 13 COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 14
Address Space Organization: Shared Address Space Non-Uniform Memory Access (NUMA)
● there is a common data shared address space
● all memory is still visible to the programmer (shared address space), but some
● processes interact by modifying objects stored in this shared address space
memory accesses take longer to access than others
● most typically implemented by so-called shared-memory computers
● designed to increase aggregated memory bandwidth with # of processors
● simplest implementation is a flat or uniform memory access (UMA)
● synchronizing concurrent access to shared data objects and processor-processor ● parallel programs should be written such that fast memory accesses are maximized
communications (to maintain coherence among multiple copies) limits performance (collocate data and computation accordingly)
● typically one observes sublinear memory bandwidth with # of processors ● example: within each Gadi node, each socket (i.e., 24-core CPU) is connected to
● example: QuadCore laptop its own memory module that is faster to access than the other (remote) one
INTERCONNECT
Cache Cache Cache Cache Cache Cache Cache Cache
Cache Cache Cache Cache PROCESSOR PROCESSOR PROCESSOR PROCESSOR PROCESSOR PROCESSOR PROCESSOR PROCESSOR
COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 15 COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 16
Dynamic Connectivity: Bus Dynamic Connectivity: Crossbar
● simplest/cheapest network: shared medium common to all processors ● employs a 2D grid of switching nodes (complexity grows as O( p2))
● its a completely-blocking network: a point-to-point comm. among a processor and a ● its a completely non-blocking network: connection among two processors does not
memory module, or among processors, prevents any other comm. block connection between any other two processors
● limited bandwidth scalability (multiple accesses to memory are serialized) ● not scalable in terms of complexity and cost
● effective cache utilization can alleviate demands on the bus bandwidth Processor Processor Processor Processor
and and and and
Memory Memory Memory Memory
Processor
and
MEMORY MEMORY MEMORY MEMORY Memory
Processor
and
BUS Memory
Processor
and
Cache Cache Cache Cache Memory
Processor
and
PROCESSOR PROCESSOR PROCESSOR PROCESSOR Memory
COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 17 COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 18
Dynamic Connectivity: Multi-staged Networks (e.g. Omega Network) Static Connectivity: Complete, Mesh, Tree
Processors Memory
SWITCHING NETWORK
Completely connected (becomes very complex!)
000 000
001
001
010 010
011
100 100
101
101
Linear array/ring, mesh/2d torus
110 110
111
111
OMEGA NETWORK
COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 19 COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 20
Static Connectivity: Hypercube Evaluating Static Interconnection Networks #1
Diameter
0100 0110 1100 1110
Connectivity
● two (and exactly two) processing nodes along each dimension, d = log2( p)
dimensions (thus p = 2d processing nodes) ● the multiplicity of paths between any two processors
● the number of connections per processor grows as log2( p) ● a high connectivity is desirable as it minimizes contention (also enhances
fault-tolerance)
● recursive construction: d -hypercube built by connecting two d − 1-hypercubes
● arc connectivity of the network: the minimum number of arcs that must be removed
● two processing nodes directly connected IF ONLY IF their labels differ by one bit
for the network to break it into two disconnected networks
● the number of links in the shortest path between two processors labeled s and t is
■ 1 for linear arrays and binary trees
the number of bits that are on (i.e., =1) in the binary representation of s ⊕ t (bitwise
■ 2 for rings and 2D meshes
XOR) operation (e.g. 3 for 101 ⊕ 010 and 2 for 011 ⊕ 101)
■ 4 for a 2D torus
● examples: Intel iPSC Hypercube, NCube, SGI Origin, Cray T3D, TOFU ■ d for d -dimensional hypercubes
COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 21 COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 22
Channel width
● the number of bits that can be communicated simultaneously over a link connecting Bisection Arc Cost
two processors Network Diameter width connectivity (no. of links)
Completely-connected 1 p2/4 p−1 p( p − 1)/2
Bisection width and bandwidth Binary Tree 2 log2(( p + 1)/2) 1 1 p−1
● bisection width is the minimum number of communication links that have to be Linear array p−1 1 1 p−1
Ring ⌊p/2⌋ 2 2 p
removed to partition the network into two equal halves √ √ √
2D Mesh 2( p − 1) p 2 2( p − p)
√ √
● bisection bandwidth is the minimum volume of communication allowed between two 2D Torus 2⌊ p/2⌋ 2 p 4 2p
halves of the network with equal numbers of processors Hypercube log2 p p/2 log2 p ( p log2 p)/2
Note: the Binary Tree suffers from a bottleneck: all traffic between the left and right
Cost
sub-trees must pass through the root. The fat tree interconnect alleviates this.
● many criteria can be used; we will use the number of communication links or wires
required by the network
COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 23 COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 24
NCI’s Gadi: A Petascale Supercomputer Further Reading: Parallel Hardware
● 184K cores (dual socket, 24 core Intel Platinum Xeon 8274 (Cascade Lake), 3.2
GHz) in 4243 compute nodes
● 192 GB memory per node (815 TB total)
● Mellanox Infiniband HDR interconnect (100Gbs, ≈ 60 km cables)
● interconnects: mesh (cores), full (sockets),
Dragonfly+ (nodes)
● The Free Lunch Is Over!
● ≈ 22 PB Lustre parallel filesystem
● Ch 1, 2.1-2.4 of Introduction to Parallel Computing
● power: 1.5 MW max. load
● cooling systems: 100 tonnes of water ● Ch 1, 2 of Principles of Parallel Programming
● 24th fastest in the world in debut (June 2020) – 9.3
PFLOPS
■ (probably) fastest file-system in the s.
hemisphere
■ custom Linux kernel (CentOS 8)
■ highly customised PBS Pro scheduler
COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 25 COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 26