0% found this document useful (0 votes)
22 views7 pages

W1 Hardwareoverview.4u

The document provides an overview of classical parallel hardware, including single processor and multiple processor designs. It discusses how multiple processors can overcome limitations of single processors by processing tasks in parallel. It then describes key aspects of processor performance such as clock speed, floating point operations per second, and how peak performance is calculated for different systems. Pipelining and instruction scheduling techniques like superscalar execution are introduced to increase processor throughput by processing instructions in parallel within the processor. Memory bandwidth is also discussed as a potential limitation for achieving peak floating point performance.

Uploaded by

Linlin Du
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views7 pages

W1 Hardwareoverview.4u

The document provides an overview of classical parallel hardware, including single processor and multiple processor designs. It discusses how multiple processors can overcome limitations of single processors by processing tasks in parallel. It then describes key aspects of processor performance such as clock speed, floating point operations per second, and how peak performance is calculated for different systems. Pipelining and instruction scheduling techniques like superscalar execution are introduced to increase processor throughput by processing instructions in parallel within the processor. Memory bandwidth is also discussed as a potential limitation for achieving peak floating point performance.

Uploaded by

Linlin Du
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Overview: Classical Parallel Hardware The Processor

Review of Single Processor Design Performs (among others):

● so we talk the same language ● floating point operations (flops) - add, mult, division (sqrt maybe!)
● integer and logical operations (and, or, etc.)
● many things happen in parallel even on a single processor
● instruction processing (fetch, decoding, etc.)
● identify potential issues that (explicitly) parallel hardware can overcome
● our primary focus will be in flops (as per required by most scientific applications)
● why should we use 2 CPUs instead of doubling the speed on one! ● main performance metric: flops/sec or just FLOPS
Multiple Processor Design The processor clock orchestrates its operation:

● Flynn’s taxonomy of parallel computers (SIMD vs MIMD) ● all ops take a fixed number of clock ticks to complete (latency)
● message-passing versus shared-address space programming ● clock speed is measured in GHz (109 cycles/second) or nsec (10−9 seconds)
● UMA versus NUMA shared-memory computers ■ Apple iPhone 6 ARM A8 1.4GHz (0.71ns), NCI Gadi Intel Xeon Cascade Lake
3.2GHz (0.31ns), IBM zEC12 processor 5.5Ghz (0.18ns)
● dynamic/static connectivity
● clock speed limited by: transistor speed, speed of light, energy consumption, etc.
● evaluating static networks
■ (to our knowledge) IBM zEC12 is fastest commodity processor at 5.5GHz
● case study: the NCI Gadi supercomputer ■ light travels about 1cm in 3.2ns, a chip is a few cm!
COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 1 COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 2

Processor Performance Illustrating pipelining with an example: Adding Numbers

flops/sec Prefix Occurrence (as of today)


103 kilo (k) very badly written code
Consider adding two double precision (8 byte) numbers
106 mega (m) badly written code
109 giga (g) single-core 0 1 11 12 63
1012 tera (t) supercomputer node ± Exponent Significand
1015 peta (p) all machines in Top500 (Nov 22, measured)
Possible steps:
1018 exa (e) 2022!

How peak flops/sec. is computed? ● determine largest exponent


● normalize significand of the smaller exponent to the larger
● Desktop 2.5GHz Quad-Core, 4(core)*4(flops)*2.5GHz ≡ 40 gflops/sec.
● add significand
● Bunyip cluster Pentium III, 96(nodes)*2(sockets)*1(core)*1(flop)*550MHz ≡ 105
gflops/sec, ● re normalize the significand and exponent of the result
● NCI Raijin 3592(nodes)*2(sockets)*8(core)*8(flops)*2.6GHz ≡ 1.19 pflops/sec. Let us assume each step take 1 clock tick, i.e., a latency of 4 ticks per addition (flop)
● NCI Gadi 3074(nodes)*2(sockets)*24(core)*16(flops)*3.2GHz ≡ 7.55 pflops/sec.

COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 3 COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 4
Illustrating pipelining with an example: Adding Numbers Instruction Pipelining (Single Instruction Issue)
● break instructions into k stages each that are overlapped in time
● eg. (k = 5): stages FI = Fetch Instrn., DI = Decode Instrn., FO = Fetch Operand, EX
Step in Pipeline
= Execute Instrn., WB = Write Back
Waiting 1 2 3 4 Done
(branch): FI DI FO EX WB
X(6) (guess) FI DI FO EX WB
X(5)→ (guess) FI DI FO EX WB
X(4)→ (guess) FI DI FO EX WB
X(3)→ (sure) FI DI FO EX WB
X(2)→
X(1) ● Ideally, one gets k-way asymptotic parallelism (speedup)
● X(1) takes 4 clock ticks to appear (startup latency); X(2) appears 1 tick after X(1) ● However, hard to maximize utilization in practice:

● asymptotically achieves 1 result per tick ■ Constrained by dependencies among instructions; CPU must ensure result is
the same as if no pipelining!
● the operation (X) is said to be pipelined: steps in the pipeline are running in parallel ■ FO & WB stages may involve memory accesses (and may possibly stall the pipeline)
● requires same op consecutively on different (independent) data items ■ conditional branch instructions are problematic: the wrong guess may require
flushing succeeding instructions from the pipeline and rolling back
■ good for “vector operations” (note limitations on chaining output data to input)
● tendency to increase # of stages (specially acute during 90s-20s)
examples of #stages: UltraSPARC II (9) and III (14), Intel Prescott (31)

COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 5 COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 6

Superscalar execution (Multiple Instruction Issue) Limitations of Memory System Performance

● Simple idea: Increase execution rate by using w ≥ 2 (i.e., multiple) pipelines Consider the DAXPY computation:

● w (mutually independent) instructions are (tried to be) piped in parallel at each cycle y(i) = y(i) + 1.234 ∗ x(i)
● Ideally it offers kw-way parallelism (recall k is the number of pipeline stages) If at its peak the CPU can perform 8 flops/cycle (4 fused mult-add)
● However, a number of extra challenges arise:
● the memory system must load 8 doubles (x(i) and y(i) – 64 bytes) and store 4
■ Increased complexity: HW has to be able to resolve dependencies at runtime (y(i) – 32 bytes) each clock cycle
before issuing simultaneously several instructions
■ on a 2 GHz system this implies a memory system able to sustain 128 GB/s load
■ Some of the functional units might be shared by the pipelines (aka resource
traffic and 64 GB/s store traffic
dependencies)
■ As a result, instructions to be issued together must have an appropriate ● despite advances in memory technology (e.g., DDR5 SDRAM), memory is not able
‘instruction mix’  to pump data at such high rates
 ≤ 2 different floating point
e.g. UltraSPARC (w = 4): ≤ 1 load / store ; ≤ 1 branch Memory latency and bandwidth are critical performance issues

≤ 2 integer / logical
● Some remedies: pipeline feedback, branch prediction + speculative execution, ● caches: reduce latency and provide improved cache to CPU bandwidth
out-of-order execution, compilers (e.g., VLIW processors) ● multiple memory banks: improve bandwidth (by parallel access)
COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 7 COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 8
Memory Hierarchy Going (Explicitly) Parallel
Main Memory −→ large, cheap memory; large latency/small bandwidth ● performance of a single processor is irremediably limited by clock rate
↓ ● clock rate in turn limited by power consumption, transistor switching time, etc.
Cache −→ small, expensive memory; lower latency/higher bandwidth
↓↓↓↓↓ ● ILP allows multiple instructions at once, but it is limited by dependencies
CPU Registers ● many problems are inherently distributed/exhibit potential parallelism
● memory is partitioned into blocks (cache lines) and mapped to cache lines using a It’s time to go (explicitly) parallel
mapping algorithm (e.g., completely associative, direct, n-way associative)
Parallel Hardware Overview
● cache lines are typically 16-128 bytes wide; entire cache lines fetched from
memory, not just one element (why?) ● Flynn’s Taxonomy of parallel processors (1966,1972)
● cache hit (few cycles)/cache miss (large number of cycles) ■ (SISD/SIMD/)SIMD/MIMD
● try to structure code to use an entire cache line of data before replacement (e.g., ● message-passing versus shared-address space programming
blocking strategies in dense matrix-matrix multiplication)
● UMA versus NUMA shared-memory computers
Cache memory is effective because algorithms often use data that: ● dynamic/static networks
● was recently accessed from memory (temporal locality) ● evaluating cost and performance of static networks
● was close to other recently accessed data (spatial locality) ● case study: NCI’s Gadi (2020–)
COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 9 COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 10

SIMD and MIMD in Flynn’s Taxonomy MIMD


SIMD: Single Instruction Multiple Data
Most successful model for parallel architectures
● also known as data parallel or vector processors (very popular in the 70s and 80s)
● more general purpose than SIMD, can be built out of off-the-shelf components
● nowadays come mainly in the form of SSE co-processing instructions
● extra burden to programmer
● other examples: GPUs; SPEs on Sony’s PS3 IBM CellBE (2006)
● perform their best with structured (regular) computations (e.g., image processing) Some challenges for MIMD machines

MIMD: Multiple Instruction Multiple Data ● scheduling: efficient allocation of processors to tasks in a dynamic fashion
● synchronization: prevent processors accessing the same data simultaneously
● examples include: (1) quad-core PC; (2) 2x24-core Xeon CPUs on each Gadi node
● interconnect design: processor to memory and processor to processor
interconnects. Also I/O network - often processors dedicated to I/O devices
Global
Control Unit
● overhead: inevitably there is some overhead associated with coordinating activities
between processors, e.g. resolve contention for resources
CPU and CPU and CPU and CPU and
CPU CPU CPU CPU Control Control Control Control

INTERCONNECT INTERCONNECT
● partitioning: partitioning a computation/algorithm into concurrent tasks might not be
trivial and require algorithm redesign and/or significant programming efforts
SIMD MIMD

COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 11 COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 12
Logical classification of parallel computers Address Space Organization: Message Passing

● logically organized as multiple processing nodes, each with its own


exclusive/private address space
● interaction among programs running on different nodes accomplished using
Regardless of how they are physically organized under the hood, from a programmer’s messages
perspective, parallel computers can be classified into two broad categories: ● messages are used to transfer data, work, and synchronization
● Message-passing (distributed address space) parallel computers ● typically implemented in practice by so called distributed memory parallel
● Shared address space parallel computers computers (although not necessarily)
● in these computers, (aggregate) memory bandwidth scales linearly with # of
processing nodes
● example: parallelism between “nodes” on the NCI Gadi system

COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 13 COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 14

Address Space Organization: Shared Address Space Non-Uniform Memory Access (NUMA)
● there is a common data shared address space
● all memory is still visible to the programmer (shared address space), but some
● processes interact by modifying objects stored in this shared address space
memory accesses take longer to access than others
● most typically implemented by so-called shared-memory computers
● designed to increase aggregated memory bandwidth with # of processors
● simplest implementation is a flat or uniform memory access (UMA)
● synchronizing concurrent access to shared data objects and processor-processor ● parallel programs should be written such that fast memory accesses are maximized
communications (to maintain coherence among multiple copies) limits performance (collocate data and computation accordingly)

● typically one observes sublinear memory bandwidth with # of processors ● example: within each Gadi node, each socket (i.e., 24-core CPU) is connected to
● example: QuadCore laptop its own memory module that is faster to access than the other (remote) one

MEMORY MEMORY MEMORY MEMORY INTERCONNECT


MEMORY MEMORY MEMORY MEMORY
INTERCONNECT MEMORY MEMORY MEMORY MEMORY

INTERCONNECT
Cache Cache Cache Cache Cache Cache Cache Cache

Cache Cache Cache Cache PROCESSOR PROCESSOR PROCESSOR PROCESSOR PROCESSOR PROCESSOR PROCESSOR PROCESSOR

cached UMA cached NUMA


PROCESSOR PROCESSOR PROCESSOR PROCESSOR

COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 15 COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 16
Dynamic Connectivity: Bus Dynamic Connectivity: Crossbar

● simplest/cheapest network: shared medium common to all processors ● employs a 2D grid of switching nodes (complexity grows as O( p2))
● its a completely-blocking network: a point-to-point comm. among a processor and a ● its a completely non-blocking network: connection among two processors does not
memory module, or among processors, prevents any other comm. block connection between any other two processors
● limited bandwidth scalability (multiple accesses to memory are serialized) ● not scalable in terms of complexity and cost
● effective cache utilization can alleviate demands on the bus bandwidth Processor Processor Processor Processor
and and and and
Memory Memory Memory Memory

Processor
and
MEMORY MEMORY MEMORY MEMORY Memory

Processor
and
BUS Memory

Processor
and
Cache Cache Cache Cache Memory

Processor
and
PROCESSOR PROCESSOR PROCESSOR PROCESSOR Memory

COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 17 COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 18

Dynamic Connectivity: Multi-staged Networks (e.g. Omega Network) Static Connectivity: Complete, Mesh, Tree
Processors Memory
SWITCHING NETWORK
Completely connected (becomes very complex!)
000 000
001
001

010 010
011

(s = 010 (src), t = 111 (dst), s ⊕ t = 101)


011

100 100
101
101
Linear array/ring, mesh/2d torus
110 110
111
111

OMEGA NETWORK

● consists of log2( p) stages, p/2 switches per stage ( p = 8 in the figure)


● switches can be configured in two modes: pass-through or crossover
● s and t are binary representations of source and destination Static (all nodes are processors) and dynamic trees (intermediate nodes are switches)

■ processed from most to least significant bit (i.e., left to right)


■ route through if current bits of s and t are the same; otherwise, crossover
Switches
● partially blocking network (e.g. consider comms 000-111 and 110-100 at once) Processors

COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 19 COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 20
Static Connectivity: Hypercube Evaluating Static Interconnection Networks #1

Diameter
0100 0110 1100 1110

0000 0010 1000 1010


● the maximum distance between any two processors in the network
0101 0111 1101 1111
d = 4, p = 16
0011 1001 1011
● directly determines communication time (latency)
0001

Connectivity
● two (and exactly two) processing nodes along each dimension, d = log2( p)
dimensions (thus p = 2d processing nodes) ● the multiplicity of paths between any two processors

● the number of connections per processor grows as log2( p) ● a high connectivity is desirable as it minimizes contention (also enhances
fault-tolerance)
● recursive construction: d -hypercube built by connecting two d − 1-hypercubes
● arc connectivity of the network: the minimum number of arcs that must be removed
● two processing nodes directly connected IF ONLY IF their labels differ by one bit
for the network to break it into two disconnected networks
● the number of links in the shortest path between two processors labeled s and t is
■ 1 for linear arrays and binary trees
the number of bits that are on (i.e., =1) in the binary representation of s ⊕ t (bitwise
■ 2 for rings and 2D meshes
XOR) operation (e.g. 3 for 101 ⊕ 010 and 2 for 011 ⊕ 101)
■ 4 for a 2D torus
● examples: Intel iPSC Hypercube, NCube, SGI Origin, Cray T3D, TOFU ■ d for d -dimensional hypercubes

COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 21 COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 22

Evaluating Static Interconnection Networks #2 Summary: Static Interconnection Characteristics

Channel width

● the number of bits that can be communicated simultaneously over a link connecting Bisection Arc Cost
two processors Network Diameter width connectivity (no. of links)
Completely-connected 1 p2/4 p−1 p( p − 1)/2
Bisection width and bandwidth Binary Tree 2 log2(( p + 1)/2) 1 1 p−1
● bisection width is the minimum number of communication links that have to be Linear array p−1 1 1 p−1
Ring ⌊p/2⌋ 2 2 p
removed to partition the network into two equal halves √ √ √
2D Mesh 2( p − 1) p 2 2( p − p)
√ √
● bisection bandwidth is the minimum volume of communication allowed between two 2D Torus 2⌊ p/2⌋ 2 p 4 2p
halves of the network with equal numbers of processors Hypercube log2 p p/2 log2 p ( p log2 p)/2

Note: the Binary Tree suffers from a bottleneck: all traffic between the left and right
Cost
sub-trees must pass through the root. The fat tree interconnect alleviates this.
● many criteria can be used; we will use the number of communication links or wires
required by the network

COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 23 COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 24
NCI’s Gadi: A Petascale Supercomputer Further Reading: Parallel Hardware

● 184K cores (dual socket, 24 core Intel Platinum Xeon 8274 (Cascade Lake), 3.2
GHz) in 4243 compute nodes
● 192 GB memory per node (815 TB total)
● Mellanox Infiniband HDR interconnect (100Gbs, ≈ 60 km cables)
● interconnects: mesh (cores), full (sockets),
Dragonfly+ (nodes)
● The Free Lunch Is Over!
● ≈ 22 PB Lustre parallel filesystem
● Ch 1, 2.1-2.4 of Introduction to Parallel Computing
● power: 1.5 MW max. load
● cooling systems: 100 tonnes of water ● Ch 1, 2 of Principles of Parallel Programming
● 24th fastest in the world in debut (June 2020) – 9.3
PFLOPS
■ (probably) fastest file-system in the s.
hemisphere
■ custom Linux kernel (CentOS 8)
■ highly customised PBS Pro scheduler
COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 25 COMP4300/8300 L2-3: Classical Parallel Hardware 2023 ◀◀ ◀ • ▶ ▶▶ 26

You might also like