0% found this document useful (0 votes)
27 views46 pages

Chapter 4

Uploaded by

bia.maiolini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views46 pages

Chapter 4

Uploaded by

bia.maiolini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Introduction to High Performance

Computing for Scientists and


Engineers
Chapter 4: Parallel Computers
Parallel Computers
✤ World’s fastest supercomputers have always exploited some degree
of parallelism in their hardware
✤ With advent of multicore processors, virtually all computers today
are parallel computers, even desktop and laptop computers
✤ Today’s largest supercomputers have hundreds of thousands of cores
and soon will have millions of cores
✤ Parallel computers require more complex algorithms and
programming to divide computational work among multiple
processors and coordinate their activities
✤ Efficient use of additional processors becomes increasingly difficult
as total number of processors grows (scalability)

2
Flynn’s Taxonomy
Computers can be classified by numbers of instruction
and data streams
✤ SISD: single instruction stream, single data stream

• conventional serial computers


✤ SIMD: single instruction stream, multiple data streams

• vector or data parallel computers


✤ MISD: multiple instruction streams, single data stream

• pipelined computers
✤ MIMD: multiple instruction streams, multiple data streams

• general purpose parallel computers


3
SPMD Programming Style
SPMD (single program, multiple data): all processors
execute same program, but each operates on different
portion of problem data

✤ Easier to program than true MIMD but more flexible than SIMD
✤ Most parallel computers today have MIMD architecture but are
programmed in SPMD style

4
Parallel Computer Architectures
Parallel architectural design issues
✤ Processor coordination: synchronous or asynchronous?
✤ Memory organization: distributed or shared?
✤ Address space: local or global?
✤ Memory access: uniform or nonuniform?
✤ Granularity: coarse or fine?
✤ Scalability: additional processors used efficiently?
✤ Interconnection network: topology, switching, routing?

5
Major Architectural Paradigms
Memory organization is fundamental architectural design
choice: How are processors connected to memory?

M0 M1 ••• MN P0 P1 ••• PN

P0 P1 ••• PN network

network M0 M1 ••• MN

distributed-memory multicomputer shared-memory multiprocessor

Can also have hybrid combinations of these


6
Parallel Programming Styles
✤ Shared-memory multiprocessor
• Entire problem data stored in common memory
• Programs do loads and stores from common (and typically
remote) memory
• Protocols required to maintain data integrity
• Often exploit loop-level parallelism using pool of tasks paradigm
✤ Distributed-memory multicomputer
• Problem data partitioned among private processor memories
• Programs communicate by sending messages between processors
• Messaging protocol provides synchronization
• Often exploit domain decomposition parallelism
7
Distributed vs. Shared Memory
distributed shared
memory memory
scalability easier harder
data mapping harder easier
data integrity easier harder
incremental
harder easier
parallelization
automatic
harder easier
parallelization
8
Shared-Memory Computers
✤ UMA (uniform memory access): same latency and bandwidth for all
processors and memory locations
• sometimes called SMP (symmetric multiprocessor)
• often implemented using bus, crossbar, or multistage network
• multicore processor is typically SMP
✤ NUMA (nonuniform memory access): latency and bandwidth vary
with processor and memory location
• some memory locations “closer” than others, with different access
speeds
• consistency of multiple caches is crucial to correctness
• ccNUMA: cache coherent nonuniform memory access

9
Cache Coherence
✤ In shared memory multiprocessor, same cache line in main memory
may reside in cache of more than one processor, so values could be
inconsistent
✤ Cache coherence protocol ensures consistent view of memory
regardless of modifications of values in cache of any processor
✤ Cache coherence protocol keeps track of state of each cache line
✤ MESI protocol is typical
• M, modified: has been modified, and resides in no other cache
• E, exclusive: not yet modified, and resides in no other cache
• S, shared: not yet modified, and resides in multiple caches
• I, invalid: may be inconsistent, value not to be trusted
10
Cache Coherence
✤ Small systems often implement cache coherence using bus snoop
✤ Larger systems typically use directory-based protocol that keeps track
of all cache lines in system
✤ Coherence traffic can hurt application performance, especially if
same cache line is modified frequently by different processors, as in
false sharing

11
Hybrid Parallel Architectures
✤ Most large computers today have hierarchical combination of shared
and distributed memory, with memory shared locally within SMP
nodes but distributed globally across nodes interconnected by
network

12
Communication Networks
✤ Access to remote data requires communication
✤ Direct connections among p processors would require O(p2) wires
and communication ports, which in infeasible for large p
✤ Limited connectivity necessitates routing data through intermediate
processors or switches
✤ Topology of network affects algorithm design, implementation, and
performance

13
Common Network Topologies

1-D mesh

1-D torus (ring) 2-D mesh 2-D torus

bus

star crossbar

14
Common Network Topologies

binary tree
butterfly

0-cube

1-cube 2-cube 3-cube 4-cube

hypercubes

15
Graph Terminology
✤ Graph: pair (V, E), where V is set of vertices or nodes connected by set
E of edges
✤ Complete graph: graph in which any two nodes are connected by an
edge
✤ Path: sequence of contiguous edges in graph
✤ Connected graph: graph in which any two nodes are connected by a
path
✤ Cycle: path of length greater than one that connects a node to itself
✤ Tree: connected graph containing no cycles
✤ Spanning tree: subgraph that includes all nodes of given graph and is
also a tree
16
Graph Models
✤ Graph model of network: nodes are processors (or switches or
memory units), edges are communication links
✤ Graph model of computation: nodes are tasks, edges are data
dependences between tasks
✤ Mapping task graph of computation to network graph of target
computer is instance of graph embedding
✤ Distance between two nodes: number of edges (hops) in shortest path
between them

17
Network Properties
✤ Degree: maximum number of edges incident on any node
• determines number of communication ports per processor
✤ Diameter: maximum distance between any pair of nodes
• determines maximum communication delay between processors
✤ Bisection width: smallest number of edges whose removal splits graph
into two subgraphs of equal size
• determines ability to support simultaneous global communication
✤ Edge length: maximum physical length of any wire
• may be constant or variable as number of processors varies

18
Network Properties

19
Graph Embedding
✤ Graph embedding: φ: Vs → Vt maps nodes in source graph Gs = (Vs, Es)
to nodes in target graph Gt = (Vt, Et)
✤ Edges in Gs mapped to paths in Gt
✤ Load: maximum number of nodes in Vs mapped to same node in Vt
✤ Congestion: maximum number of edges in Es mapped to paths
containing same edge in Et
✤ Dilation: maximum distance between any two nodes φ(u), φ(v) ∈ Vt
such that (u,v) ∈ Es

20
Graph Embedding
✤ Uniform load helps balance work across processors
✤ Minimizing congestion optimizes use of available bandwidth of
network links
✤ Minimizing dilation keeps nearest-neighbor communications in
source graph as short as possible in target graph
✤ Perfect embedding has load, congestion, and dilation 1, but not
always possible
✤ Optimal embedding difficult to determine (NP-complete, in general),
so heuristics used to determine good embedding

21
Graph Embedding Examples
✤ For some important cases, good or optimal embeddings are known

22
Gray Code
✤ Gray code: ordering of integers 0 to 2n−1 such that consecutive
members differ in exactly one bit position
✤ Example: binary reflected Gray code of length 16

23
Computing Gray Code

24
Hypercube Embeddings
✤ Visiting nodes of hypercube in Gray code order gives Hamiltonian
cycle embedding ring in hypercube

✤ For mesh or torus of higher dimension, concatenating Gray codes for


each dimension gives embedding in hypercube
25
Communication Cost
✤ Simple model for time required to send message (move data)
between adjacent nodes: Tmsg = ts + tw L, where

• ts = startup time = latency (time to send message of length 0)


• tw = incremental transfer time per word (bandwidth = 1/tw)
• L = length of message in words
✤ For most real parallel systems, ts >> tw
✤ Caveats
• Some systems treat message of length 0 as special case or may
have minimum message size greater than 0
• Many systems use different protocols depending on message size
(e.g. 1-trip vs. 3-trip)

26
Message Routing
✤ Messages sent between nodes that are not directly connected must be
routed through intermediate nodes
✤ Message routing algorithms can be

• minimal or nonminimal, depending on whether shortest path is


always taken
• static or dynamic, depending on whether same path is always taken
• deterministic or randomized, depending on whether path is chosen
systematically or randomly
• circuit switched or packet switched, depending on whether entire
message goes along reserved path or is transferred in segments
that may not all take same path
✤ Most regular network topologies admit simple routing schemes that
are static, deterministic, and minimal
27
Message Routing Examples

28
Routing Schemes
✤ Store-and-forward routing: entire message is received and stored
at each node before being forwarded to next node on path, so
Tmsg = (ts + tw L) D, where D = distance in hops
✤ Cut-through (or wormhole) routing: message broken into segments
that are pipelined through network, with each segment
forwarded as soon as it is received, so Tmsg = ts + tw L + th D,
where th = incremental time per hop

29
Communication Concurrency
✤ For given communication system, it may or may not be possible for
each node to

• send message while receiving another simultaneously on same


communication link

• send message on one link while receiving simultaneously on


different link

• send or receive, or both, simultaneously on multiple links


✤ Depending on concurrency supported, time required for each step of
communication algorithm is effectively multiplied by appropriate
factor (e.g., degree of network graph)

30
Communication Concurrency
✤ When multiple messages contend for network bandwidth, time
required to send message modeled by Tmsg = ts + tw S L, where S is
number of messages sent concurrently over same communication
link
✤ In effect, each message uses 1/S of available bandwidth

31
Collective Communication
✤ Collective communication: multiple nodes communicating
simultaneously in systematic pattern, such as

• broadcast: one-to-all
• reduction: all-to-one

• multinode broadcast: all-to-all

• scatter/gather: one-to-all/all-to-one
• total or complete exchange: personalized all-to-all

• scan or prefix
• circular shift

• barrier
32
Collective Communication

33
Broadcast
✤ Broadcast: source node sends same message to each of p−1 other
nodes
✤ Generic broadcast algorithm generates spanning tree, with source
node as root

34
Broadcast

35
Broadcast
✤ Cost of broadcast depends on network, for example

• 1-D mesh: Tbcast = (p − 1) (ts + tw L)

• 2-D mesh: Tbcast = 2 (√p − 1) (ts + tw L)

• hypercube: Tbcast = log p (ts + tw L)


✤ For long messages, bandwidth utilization may be enhanced by
breaking message into segments and either

• pipeline segments along single spanning tree, or

• send each segment along different spanning tree having same root

• can also use scatter/allgather

36
Reduction
✤ Reduction: data from all p nodes are combined by applying specified
associative operation ⊕ (e.g., sum, product, max, min, logical OR,
logical AND) to produce overall result
✤ Generic broadcast algorithm generates spanning tree, with source
node as root

37
Reduction

38
Reduction
✤ Subsequent broadcast required if all nodes need result of reduction
✤ Cost of reduction depends on network, for example

• 1-D mesh: Tbcast = (p − 1) (ts + (tw + tc) L)

• 2-D mesh: Tbcast = 2 (√p − 1) (ts + (tw + tc) L)

• hypercube: Tbcast = log p (ts + (tw + tc) L)


✤ Time per word for associative reduction operation, tc , is often much
smaller than tw , so is sometimes omitted from performance analyses

39
Multinode Broadcast
✤ Multinode broadcast: each of p nodes sends message to all other nodes
(all-to-all)
✤ Logically equivalent to p broadcasts, one from each node, but
efficiency can often be enhanced by overlapping broadcasts
✤ Total time for multinode broadcast depends strongly on concurrency
supported by communication system
✤ Multinode broadcast need be no more costly than standard broadcast
if aggressive overlapping of communication is supported

40
Multinode Broadcast
✤ Implementation of multinode broadcast in specific networks

• 1D torus (ring): initiate broadcast from each node simultaneously


in same direction around ring; completes after p − 1 steps at same
cost as single-node broadcast

• 2D or 3D torus: apply ring algorithm successively in each


dimension

• hypercube: exchange messages pairwise in each of log p


dimensions, with messages concatenated at each stage
✤ Multinode broadcast can be used to implement reduction by
combining messages using associative operation instead of
concatenation, which avoids subsequent broadcast when result
needed by all nodes
41
Multinode Reduction
✤ Multinode reduction: each of p nodes is destination of reduction from
all other nodes
✤ Algorithms for multinode reduction are essentially reverse of
corresponding algorithms for multinode broadcast

42
Personalized Communication
✤ Personalized collective communication: each node sends (or receives)
distinct message to (or from) each other node

• scatter: analogous to broadcast, but root sends different message to


each other node

• gather: analogous to reduction, but data received by root are


concatenated rather than combined using associative operation

• total exchange: analogous to multinode broadcast, but each node


exchanges different message with each other node

43
Scan or Prefix
✤ Scan (or prefix): given data values x0, x1, . . ., xp−1, one per node, along
with associative operation ⊕, compute sequence of partial results s0,
s1, . . ., sp−1, where sk = x0 ⊕ x1 ⊕ ⋅ ⋅ ⋅ ⊕ xk and sk is to reside on node k,
k = 0, . . ., p − 1
✤ Scan can be implemented similarly to multinode broadcast, except
intermediate results received by each node are selectively combined
depending on sending node's numbering, before being forwarded

44
Circular Shift
✤ Circular k-shift: for 0 < k < p, node i sends data to node (i + k) mod p
✤ Circular shift implemented naturally in ring network, and by
embedding ring in other networks

45
Barrier
✤ Barrier: synchronization point that all processes must reach before
any process is allowed to proceed beyond it
✤ For distributed-memory systems, barrier usually implemented by
message passing, using algorithm similar to all-to-all

• Some systems have special network for fast barriers


✤ For shared-memory systems, barrier usually implemented using
mechanism for enforcing mutual exclusion, such as test-and-set or
semaphore, or with atomic memory operations

46

You might also like